Fast online and index-based algorithms for approximate search of RNA sequence-structure patterns

Meyer, Fernando; Kurtz, Stefan; Beckstette, Michael

doi:10.1186/1471-2105-14-226

Methodology article
Open access
Published: 17 July 2013

Fast online and index-based algorithms for approximate search of RNA sequence-structure patterns

Fernando Meyer¹,
Stefan Kurtz¹ &
Michael Beckstette¹

BMC Bioinformatics volume 14, Article number: 226 (2013) Cite this article

5429 Accesses
6 Citations
12 Altmetric
Metrics details

Abstract

Background

It is well known that the search for homologous RNAs is more effective if both sequence and structure information is incorporated into the search. However, current tools for searching with RNA sequence-structure patterns cannot fully handle mutations occurring on both these levels or are simply not fast enough for searching large sequence databases because of the high computational costs of the underlying sequence-structure alignment problem.

Results

We present new fast index-based and online algorithms for approximate matching of RNA sequence-structure patterns supporting a full set of edit operations on single bases and base pairs. Our methods efficiently compute semi-global alignments of structural RNA patterns and substrings of the target sequence whose costs satisfy a user-defined sequence-structure edit distance threshold. For this purpose, we introduce a new computing scheme to optimally reuse the entries of the required dynamic programming matrices for all substrings and combine it with a technique for avoiding the alignment computation of non-matching substrings. Our new index-based methods exploit suffix arrays preprocessed from the target database and achieve running times that are sublinear in the size of the searched sequences. To support the description of RNA molecules that fold into complex secondary structures with multiple ordered sequence-structure patterns, we use fast algorithms for the local or global chaining of approximate sequence-structure pattern matches. The chaining step removes spurious matches from the set of intermediate results, in particular of patterns with little specificity. In benchmark experiments on the Rfam database, our improved online algorithm is faster than the best previous method by up to factor 45. Our best new index-based algorithm achieves a speedup of factor 560.

Conclusions

The presented methods achieve considerable speedups compared to the best previous method. This, together with the expected sublinear running time of the presented index-based algorithms, allows for the first time approximate matching of RNA sequence-structure patterns in large sequence databases. Beyond the algorithmic contributions, we provide with RaligNAtor a robust and well documented open-source software package implementing the algorithms presented in this manuscript. The RaligNAtor software is available at http://www.zbh.uni-hamburg.de/ralignator.

Background

Due to their participation in several important molecular-biological processes, ranging from passive carriers of genetic information (tRNAs) over regulatory functions (microRNAs) to protein-like catalytic activities (Riboswitsches), non-coding RNAs (ncRNAs) are of central research interest in molceular biology [1]. NcRNAs, although synthesized as single-stranded molecules, present surprising complexity by being able to base pair with themselves and fold into numerous different structures. It is to a large extent the structure that allows them to interact with other molecules and hence to carry out various biological functions. This can also be observed in families of functionally related ncRNAs like the ones compiled in the Rfam database [2]. Here members of a family often share only few sequence features, but share by far more specific structural and functional properties. Consequently, methods for effective RNA homology search (i.e. finding new members of an RNA family) cannot rely on sequence similarity alone, but also have to take structural similarity into account.

In this paper, we address the problem of searching nucleotide databases for occurrences of RNA family members. Since for this task it is not sufficient to rely on pure sequence alignment, we briefly review search methods that employ sequence and structure information.

There exist various general sequence-structure alignment tools which determine structural similarities that are too diverse to be alignable at the sequence level. Such tools can roughly be divided into two classes. The first class consists of tools that align RNAs with given structures or determine a common structure during the alignment process. Tools like MARNA[3] and RNAforester[4] require an a priori known secondary structure for both input RNAs. However, they suffer from the low quality of secondary structure prediction. Addressing this problem, other tools implement variations of the Sankoff algorithm [5], which provides a general but computationally demanding solution to the problem of simultaneously computing an alignment and the common secondary structure of the two aligned sequences. Unfortunately, even tools with improved running times using variations of this algorithm (LocARNA[6], Foldalign[7, 8], Dynalign[9, 10]) or heuristics [11] are simply not fast enough for rapid searches in large nucleotide databases. Hence, in a second class we identify more specialized tools for searching RNA families in nucleotide databases. These tools use a model or motif descriptors (i.e. patterns) defining consensus sequence and secondary structure properties of the families to be searched for. For example, Infernal[12] and RSEARCH[13] infer a covariance model from a given multiple sequence alignment annotated with structure information. This model can then be used to search sequence databases for new family members. Another tool, ERPIN[14] is also based on automatically generated statistical secondary profiles. Although being very sensitive in RNA homology search, in particular Infernal and RSEARCH suffer from high computational demands. An alternative are tools like RNAMotif[15], RNAMOT[16], RNABOB[17], RNAMST[18], PatScan[19], PatSearch[20], or Palingol[21]. These methods use user-defined motif descriptors created from a priori knowledge about the secondary structure of the described RNA family. Another tool, Locomotif[22], generates a thermodynamic matcher program from a pattern drawn interactively by the user via a graphical interface. Although these tools based on motif descriptors are faster than the previously mentioned tools, they have a running time that scales at least linearly with the size of the target sequence database. This makes their application to large databases challenging. Previously, we addressed this problem by presenting Structator[23], an ultra fast index-based bidirectional matching tool that achieves sublinear running time by exploiting base pair complementarity constraints for search space reduction.

Apart from running time constraints, another major disadvantage of all current tools that search for sequence-structure patterns is their limited capacity to find approximate matches to the patterns. Although variability in length of pattern elements is often allowed, this is constrained to certain pattern positions that must be specified by the user. This limitation also holds for our Structator tool. Also, variations (insertions, deletions, or replacements) in the sequence that lead to small structural changes, such as the breaking of a base pair, are not supported. This often hampers the creation of patterns that are specific but generalized enough to match all family members. An algorithm presented in [24] only partially alleviates this problem by finding approximate matches of a helix in a genome allowing edit operations on single bases, but not on the structure.

To overcome these issues, we present new fast index-based and online algorithms for approximate matching of sequence-structure patterns, all implemented in an easy-to-use software package. Given one or more patterns describing any (branching, non-crossing) RNA secondary structure, our algorithms compute alignments of the complete patterns to substrings of the target sequence, i.e. semi-global alignments, taking sequence and structure into account. For this, they apply a full set of edit operations on single bases and base pairs. Matches are reported for alignments whose sequence-structure edit cost and number of insertions and deletions do not exceed user-defined thresholds. Our most basic algorithm is a scanning variant of the dynamic programming algorithm for global pairwise sequence-structure alignment of Jiang et al.[25], for which no implementation was available. Because its running time is too large for database searches on a large scale, we present accelerated online and index-based algorithms. All our new algorithms profit from a new computing scheme to optimally reuse the required dynamic programming matrices and a technique to save computation time by determining as early as possible whether a substring of the target sequence can contain a match. In addition, our index-based algorithms employ the suffix array data structure compiled from the search space. This further reduces the running time.

As in [23], we also support the description of an RNA molecule by multiple ordered sequence-structure patterns. In this way, the molecule’s secondary structure is decomposed into a sequence of substructures described by independent sequence-structure patterns. These patterns are efficiently aligned to the target sequences using one of our new algorithms and the results are combined with fast global and local chaining algorithms [23, 26]. This allows a better balancing of running time, sensitivity, and specificity compared to searching with a single long pattern describing the complete sequence and secondary structure.

Before we describe our algorithms, we formalize the approximate search problem with the involved sequence-structure edit operations. Then we present, step by step, two efficient online and two index-based matching algorithms. We proceed with a short review of the approach for computing chains of matches. Finally, we present several benchmark experiments.

Methods

Preliminaries

An RNA sequence S of length n = |S| over the set of bases $A = {A, C, G, U}$ is a juxtaposition of n bases from $A$ . S[i], 1 ≤ i ≤ n, denotes the base of S at position i. Let ε denote the empty sequence, the only sequence of length 0. By $A^{n}$ we denote the set of sequences of length n ≥ 0 over $A$ . The set of all possible sequences over $A$ including the empty sequence ε is denoted by $A^{*}$ .

For a sequence S = S[1] S[2] … S[n] and 1 ≤ i ≤ j ≤ n, S[i..j] denotes the substring S[i] S[i + 1] … S[j] of S. For S = u v, u and $v \in A^{*}$ , u is a prefix of S, and v is a suffix of S. The k-th suffix of S starts at position k, while the k-th prefix of S ends at k. For 1 ≤ k ≤ n, S_k denotes the k-th suffix of S. For stating the space requirements of our index structures, we assume that |S| < 2³², so that sequence positions and lengths can be stored in 4 bytes.

The secondary structure of an RNA molecule is formed by Watson-Crick pairing of complementary bases and also by the slightly weaker wobble pairs. We say that two bases $(c, d) \in A \times A$ are complementary and can form a base pair if and only if $(c, d) \in C = {(A, U), (U, A), (C, G), (G, C), (G, U), (U, G)}$ . If two bases a and b form a base pair we also say that there exists an arc between a and b. A non-crossing RNA structure $\hat{R}$ of length m is a set of base pairs (i,j), 1 ≤ i < j ≤ m, stating that the base at position i pairs with the base at position j, such that for all $(i, j), (i^{'}, j^{'}) \in \hat{R} : i < i^{'} < j^{'} < j$ or i^′ < i < j < j^′ or i < j < i^′ < j^′ or i^′ < j^′ < i < j. A standard notation for $\hat{R}$ is a structure string R over the alphabet {.,(,)} such that for each base pair $(i, j) \in \hat{R}$ , R[i] = ( and R[j] = ), and R[r] = . for positions r, 1 ≤ r ≤ m, that do not occur in any base pair of $\hat{R}$ , i.e. r ≠ i and r ≠ j for all $(i, j) \in \hat{R}$ .

Let Φ = {R, Y, M, K, W, S, B, D, H, V, N} be a set of characters. According to the IUPAC definition, each character in Φ denotes a specific character class $φ (x) \subseteq A$ . Each character $x \in A$ can be seen as a character class φ(x) = {x} of exactly one element. A sequence pattern is a sequence $P \in {(A \cup Φ)}^{*}$ . An RNA sequence-structure pattern (RSSP) $Q = (P, R)$ of length m is a pair of a sequence pattern P and a structure string R, both of length m. With $Q [i..j]$ we denote the RSSP region (P[i..j],R[i..j]).

Approximate matching of RNA sequence-structure patterns

To find in a long RNA sequence S approximate matches of an RSSP $Q$ describing a part of an RNA molecule, we compute alignments of the complete $Q$ and substrings of S considering edit operations for unpaired bases and base pairs. That is, we compute semi-global alignments simultaneously obtaining the sequence-structure edit distance of $Q$ and substrings of S.

We define the alignment of $Q$ and a substring S[p..q], 1 ≤ p ≤ q ≤ n, as set A = A_match ⊎ A_gap. The set A_match ⊆ [1..m] × [p..q] of match edges satisfies that, for all different (k,l),(k^′,l^′) ∈ A_match, k > k^′ implies l > l^′. The set A_gap of gap edges is defined as ${(x, -) | x \in [1 ..m] \land ∄ y, (x, y) \in A_{match}} \cup {(-, y) | y \in [p..q] \land ∄ x, (x, y) \in A_{match}}$ . See Figure 1 for an example of a semi-global alignment and associated alignment edges. The alignment cost is based on a sequence-structure edit distance. The allowed edit operations on unpaired bases P[k] and S[l], 1 ≤ k ≤ m, p ≤ l ≤ q, are base mismatch (match), with cost ω_m (zero), which occurs if there is an edge (k,l) ∈ A_match and S[l] ∉ φ(P[k]) (S[l] ∈ φ(P[k])), and base deletion (insertion), with cost ω_d, which occurs if (k,−) ∈ A_gap ((−,l) ∈ A_gap). The possible edit operations on base pairs were first introduced by Jiang et al.[25] and are defined as follows. Let (k₁,k₂) be a base pair in $\hat{R}$ and l₁ and l₂, p ≤ l₁ < l₂ ≤ q, be positions in S.

An arc breaking, with cost ω_b, occurs if (k₁,l₁) ∈ A_match and (k₂,l₂) ∈ A_match but bases S[l₁] and S[l₂] are not complementary. An additional base mismatch cost ω_m is caused if S[l₁] ∉ φ(P[k₁]) and another if S[l₂] ∉ φ(P[k₂]). To give an example, consider the semi-global alignment in Figure 1. RSSP $Q$ contains base pair $(5, 9) \in \hat{R}$ and there exist edges (5,11) ∈ A_match and (9,16) ∈ A_match but S[11] = G and S[16] = G are not complementary. We note a difference between our definition and the definition of Jiang et al., where both aligned sequences are annotated with structure information. There, an arc breaking occurs if bases S[l₁] and S[l₂] are annotated as unpaired in addition to the condition of existing edges (k₁,l₁) ∈ A_match and (k₂,l₂) ∈ A_match. Hence, because in our case sequence S has no structure annotation, our definition is based on the complementarity of bases S[l₁] and S[l₂].
An arc altering, with cost ω_a, occurs if either (1) (k₁,l₁) ∈ A_match and (k₂,−) ∈ A_gap or (2) (k₂,l₂) ∈ A_match and (k₁,−) ∈ A_gap. Each case induces an additional base mismatch cost ω_m if S[l₁] ∉ φ(P[k₁]) or S[l₂] ∉ φ(P[k₂]). As an example, observe in the alignment shown in Figure 1 that there exist a base pair $(11, 16) \in \hat{R}$ and edges (11,−) ∈ A_gap and (16,21) ∈ A_match.
An arc removing, with cost ω_r, occurs if (k₁,−) ∈ A_gap and (k₂,−) ∈ A_gap. As an example, observe in the alignment in Figure 1 that there exist a base pair $(3, 19) \in \hat{R}$ and edges (3,−) ∈ A_gap and (19,−) ∈ A_gap.

With this set of edit operations on the sequence and structure we can now define the cost of the alignment of $Q$ and S[p..q] as

\begin{array}{l} dist (Q, S [p..q]) = min {dis t_{A} (Q, S [p..q]) | A \\ is an alignment of Q and S [p..q]} \end{array}

(1)

where

\begin{array}{l} \begin{array}{lcl} dis t_{A} (Q, S [p..q]) = \\ \sum_{(k, l) \in A, R [k] = ., S [l] \notin φ (P [k])} ω_{m} & base mismatch \\ + & \sum_{(k, -) \in A, R [k] = .} ω_{d} & base deletion \end{array} \\ \begin{array}{lcl} + & \sum_{(-, l) \in A} ω_{d} & base insertion \\ + & \sum_{(k_{1}, k_{2}) \in \hat{R}, (k_{1}, l_{1}) \in A, (k_{2}, l_{2}) \in A, (S [l_{1}], S [l_{2}]) \notin C} ω_{b} & arc breaking \\ + & \sum_{(k_{1}, k_{2}) \in \hat{R}, (k_{1}, l_{1}) \in A, (k_{2}, -) \in A} ω_{a} & arc altering \\ + & \sum_{(k_{1}, k_{2}) \in \hat{R}, (k_{2}, l_{2}) \in A, (k_{1}, -) \in A} ω_{a} & arc altering \\ + & \sum_{(k_{1}, k_{2}) \in \hat{R}, (k_{1}, -) \in A, (k_{2}, -) \in A} ω_{r} & arc removing . \end{array} \end{array}

(2)

An alignment A of minimum cost between $Q$ and S[p..q] is an optimal alignment of $Q$ and S[p..q].

In practice, one is often interested in finding substrings of an RNA sequence S having a certain degree of similarity to a given RSSP $Q$ on both the sequence and structure levels. Therefore, we are only concerned about optimal alignments of $Q$ and substrings S[p..q] with up to a user-defined sequence-structure edit distance and a limited number of allowed insertions and deletions (indels). More precisely:

the cost $dist (Q, S [p..q])$ should not exceed a given threshold $K$ , and
the number of indels in the alignment should be at most d.

Thus, the approximate search problem for finding occurrences of an RSSP $Q$ in S, given user-defined thresholds $K$ and d, is to report all intervals [p..q] such that

dist (Q, S [p..q]) \leq K and m - d \leq | S [p..q] | \leq m + d \leq n .

(3)

We call every substring S[p..q] satisfying Equation (3) a match of $Q$ in S. In the subsequent sections we present algorithms for searching for matches of an RSSP $Q$ in a sequence S.

Online approximate RNA database search for RSSPs: ScanAlign

A straightforward algorithm to search for approximate matches of an RSSP $Q$ in an RNA sequence S consists of sliding a window of length m^′ = m + d along S while computing $dist (Q, S [p..q])$ for 1 ≤ p ≤ q ≤ n and q − p + 1 = m^′. We note that, although the length of a match can vary in the range m − d to m + d, to find matches of all possible lengths it suffices to slide a window of length m^′ along S corresponding to substrings S[p..q]. This holds because the alignment to a window of length m^′ entails all possible alignments with up to d allowed indels. In the following we present a dynamic programming algorithm computing $dist (Q, S [p..q])$ for every window S[p..q]. Our recurrences are derived from the algorithm for global pairwise sequence-structure alignment of Jiang et al.[25], i.e. an algorithm for aligning sequences of similar lengths. Although Jiang’s algorithm supports the sequence-structure edit operations described above, we emphasize that it is not suitable for computing semi-global alignments, which is what we are interested in.

We begin the description of our algorithm by defining three functions required by the dynamic programming recurrences. Let T = S[p..q].

1.
For computing base match and mismatch costs for positions i and j of the RSSP $Q = (P, R)$ and substring T, respectively, we define a function $χ : N \times N \to {0, 1}$ as:
$\begin{array}{l} χ (i, j) = \{\begin{array}{l} 0 & if T [j] \in φ (P [i]) & (base match) \\ 1 & otherwise. & (base mismatch) \end{array} \end{array}$
(4)
2.
To determine whether an arc breaking operation can occur, we must also be able to check for base complementarity at positions i and j of T. Therefore, we define a function $comp : N \times N \to {0, 1}$ as:
$\begin{array}{l} comp (i, j) = \{\begin{array}{l} 0 & if (T [i], T [j]) \in C & (complementary) \\ 1 & otherwise. & (not complementary) \end{array} \end{array}$
(5)
3.
For determining the correct row (of the dynamic programming matrices introduced below) where certain operation costs must be stored we introduce a function $row : N \to N$ defined as:
$row (i) = \{\begin{array}{l} i^{'} & if (i^{'}, i) \in \hat{R} and 1 < i^{'} < i < m and R [i + 1] \\ = . and R [i^{'} - 1] \neq ( \\ 0 & if (i, i^{'}) \in \hat{R} and R [i + 1] = . \\ i & otherwise . \end{array}$
(6)

Intuitively, function row satisfies the following: (1) given the right index i of a base pair (i^′,i), it returns the left index i^′ if (i^′,i) is preceded or followed by other structures; (2) given the left index i of a base pair (i,i^′), it returns 0 if the base at position i + 1 of $Q$ is unpaired; and (3) given any other position index i, it returns i itself.

Using these three functions, our algorithm determines the sequence-structure edit distance $dist (Q, T [1 .. m^{'}])$ by computing a series of m^′ + 1(m^′ + 1) × (m^′−k + 1) matrices D P_k, for 1 ≤ k ≤ m^′ + 1, such that $D P_{1} (row (m), m^{'}) = dist (Q, T [1 .. m^{'}])$ . We remark that D P_k(i,j) is not defined for every subinterval [i..j]. While the recurrences of Jiang’s algorithm are divided in four main cases, we present a simplified recurrence relation with only two main cases. In addition, we observe that we use only three indices for a matrix entry instead of four. Our recurrences are as follows.

1.
If i = 0 or R[i] =. (unpaired base), then
$\begin{array}{lcr} \begin{array}{l} D P_{k} (i, j) = \{\begin{array}{l} 0 & if i = 0 and j = 0 \\ D P_{k} (0, j - 1) + ω_{d} & if i = 0 and j > 0 \\ D P_{k} (row (i - 1), 0) + ω_{d} & if i > 0 and j = 0 \\ min \{\begin{array}{l} D P_{k} (row (i - 1), j) + ω_{d} \\ D P_{k} (i, j - 1) + ω_{d} \\ D P_{k} (row (i - 1), j - 1) + χ (i, j) ω_{m} \end{array}\} & if i > 0 and j > 0 \end{array} \end{array} \end{array}$
(7)
2.
If R[i] ≠. (paired base), then
1. (a)
  If R[i] =) where i forms base pair $(i^{'}, i) \in \hat{R}$ ,
  $\begin{array}{l} D P_{k} (i, j) = \{\begin{array}{l} D P_{k} (row (i - 1), 0) + ω_{r} & if j = 0 \\ min \{\begin{array}{l} D P_{k} (row (i - 1), j - 1) + χ (i, j + k) ω_{m} + ω_{a} \\ D P_{k + 1} (row (i - 1), j - 1) + χ (i^{'}, k) ω_{m} + ω_{a} \\ D P_{k} (row (i - 1), j) + ω_{r} \\ D P_{k} (i, j - 1) + ω_{d} \\ D P_{k + 1} (i, j - 1) + ω_{d} \\ D P_{k + 1} (row (i - 1), j - 2) + (χ (i, j + k) \\ + χ (i^{'}, k + 1)) ω_{m} + \\ comp (k + 1, j + k) ω_{b}, if j > 1 \end{array}\} & if j > 0 \end{array} \end{array}$
  (8)
2. (b)
  If (a) holds and either R[i ^′−1] =. or R[i ^′−1] =), compute in addition to Equation (8)
  $D P_{k} (row (i), j) = \{\begin{array}{l} D P_{k} (row (i^{'} - 1), 0) + D P_{k} (i, 0) & if j = 0 \\ min \{D P_{k} (row (i^{'} - 1), j^{'}) + D P_{k + j^{'}} (i, j - j^{'}) | 0 \leq j^{'} \leq j\} & if j > 0 \end{array}$
  (9)

A natural way to compute these DP matrices is top down, checking whether case 1, 2(a), or 2(b) applies, in this order. Due to the matrix dependencies in cases 2(a) and (b), the matrices need to be computed simultaneously.

Note that for all j, 1 ≤ j ≤ m^′, clearly $D P_{1} (row (m), j) = dist (Q, T [1 ..j])$ . Therefore all candidate matches shorter than m^′ beginning at position p are also computed in the computation of $dist (Q, T [1 .. m^{'}])$ . The following Lemma is another important contribution of this work and also the key for the development of an efficient algorithm.

Lemma 1

When sliding a window along S to compute $dist (Q, S [p..q])$ , 1 ≤ p ≤ q ≤ n, m^′ = q − p + 1 = m + d, a window shift by one position to the right requires to compute only column m^′−k + 1, i.e. the last column of matrices D P_k, 1 ≤ k ≤ m^′.

Proof

Let T[1..m^′] = S[p..q]. The computation of $dist (Q, T [1 .. m^{'}])$ requires to compute m^′ + 1 DP matrices, one for each suffix T_k of string T = T[1..m^′], 1 ≤ k ≤ m^′, and one for the empty sequence ε. As a result, it holds for every k that $dist (Q, T_{k}) = D P_{k} (row (m), m^{'})$ which is obtained as a by-product of the $dist (Q, T)$ computation. Because each substring T_l+1[1..m^′−l] = S[p + l..q], 0 ≤ l < m^′, only differs by its last character from S[p + l + 1..q + 1] which are suffixes of the window substring shifted by one position to the right, the lemma holds. □

Due to Lemma 1, our algorithm computes only the last column of the DP matrices for every shifted window substring (see the example in Figure 2) and just for the first window S[1..m^′] it computes every column. We call this algorithm ScanAlign. We note that during the reviewing process of this manuscript, Will et al.[27] submitted and published an algorithm for semi-global sequence-structure alignment of RNAs. As our method, this algorithm saves computation time by reusing entries of dynamic programming tables while scanning the target sequence.

Our ScanAlign algorithm has the following time complexity: computing D P_k(i,j) in cases 1 and 2(a) takes O(1) time and in case 2(b) it takes O(m^′) time. Now consider the two situations:

For the first computed window substring S[1..m^′], cases 1 and 2(a) require O(m m^′2) time in total and case 2(b) requires O(m m^′3) time in total. This leads to an overall time of O(m m^′3).
For one window shift, cases 1 and 2(a) require O(m m^′) time in total and case 2(b) requires O(m m^′2) time in total, leading to an overall time of O(m m^′2).

Since there are n − m^′−1 window shifts, the computation for all shifted windows takes O(m m^′2(n−m^′)) = O(m m^′2n) time. We observe that the time needed by ScanAlign to compute all window shifts reduces to O(m m^′n) if recurrence case 2(b) is not required. This is the case if the structure of $Q$ does not contain unpaired bases before a base pair constituting e.g. a left dangling end or left bulge.

Faster online alignment with early-stop computation: LScanAlign

Often, before completing the computation of the alignment between an RSSP $Q$ and a window substring S[p..q] of the searched RNA sequence, we can determine whether the cost of this alignment will exceed the cost threshold $K$ . By identifying this situation as early as possible, we can improve algorithm ScanAlign to skip the window, thus saving computation time and proceed with aligning the next window. The idea consists in checking, during the alignment computation, whether the cost of an already aligned region of $Q$ and a substring of S[p..q] exceeds $K$ . In such a case, the alignment cost of the complete $Q$ and S[p..q] will also exceed $K$ . In more detail, this works as follows.

We decompose the RSSP $Q$ into regions that can themselves represent a pattern, e.g. a stem-loop or unpaired region. A basic constraint is to not split base pairs to different regions.
We compute the alignment of a given initial RSSP region and a substring of the current window S[p..q], progressively extending the alignment to other regions.
If the cost of aligning an RSSP region to a substring of the window exceeds cost threshold $K$ , then the entire pattern cannot match the window. This means that the window can immediately be skipped.

Formally, a valid RSSP region $Q [x..y]$ , 1 ≤ x ≤ y ≤ m, satisfies exactly one of the following conditions.

1.
$Q [x..y]$ is a left dangling (unpaired) end of the pattern in 5^′ to 3^′ direction, i.e. x = 1. Alternatively, it is an unpaired region of maximal length such that position x − 1 forms a base pair $(x - 1, y^{'}) \in \hat{R}$ for some position y ^′ of $Q$ . Observe that no extension of $Q [x..y]$ by another unpaired position is possible. As an example, consider the green marked regions $Q [1 .. 2]$ , $Q [4 .. 4]$ , $Q [6 .. 8]$ , and $Q [12 .. 15]$ in Figure 3.
2.
Position y is unpaired and there is at least one base pair $(x^{'}, y^{'}) \in \hat{R}$ , x ≤ x ^′ < y ^′ < y. No extension of $Q [x..y]$ by another unpaired position is possible. As examples of regions under these requirements, see the regions in orange of the RSSP $Q$ in Figure 3, namely $Q [4 .. 10]$ , $Q [4 .. 18]$ , and $Q [1 .. 20]$ .
3.
$(x, y) \in \hat{R}$ is a base pair. For examples of such RSSP regions, see the regions in blue of the RSSP in Figure 3, namely $Q [5 .. 9]$ , $Q [11 .. 16]$ , and $Q [3 .. 19]$ .
4.
y forms a base pair $(x^{'}, y) \in \hat{R}$ where either R[x ^′−1] =. or R[x ^′−1] =), 1 ≤ x ≤ x ^′−1. In addition, x = 1 or $(x - 1, y^{'}) \in \hat{R}$ for some y ^′ > y. Examples of such RSSP regions are shown in red in Figure 3, i.e. regions $Q [4 .. 9]$ , $Q [4 .. 16]$ , and $Q [1 .. 19]$ .

Note that regions can be embedded in other regions but cannot partially overlap another.

Our progressive alignment computation of an RSSP $Q$ and a window substring of the searched RNA sequence S begins by considering only an in general small region of $Q$ embedded in another region. The computation is then extended to a surrounding region, e.g. from region $Q [6 .. 8]$ to $Q [5 .. 9]$ of the RSSP shown in Figure 3, until it entails the largest region surrounding all other regions, e.g. $Q [1 .. 20]$ of the same example. Formally, we elaborate the alignment computation as follows. Let T = T[1..m^′] be a window substring of length m^′ = m + d of S and d be the number of allowed indels. Pattern regions have the property that, for any region $Q [x..y]$ , computing $dist (Q [x..y], T)$ does not depend on any other region $Q [x^{'} .. y^{'}]$ for some y^′ < x and x^′ < y. Therefore, they can easily be sorted to indicate the order by which the rows of the DP matrices are computed. We observe that the top-down computation of the DP matrices, as described above, automatically sorts the regions and respects the dependency between rows. To obtain from the sorted regions the indices of the rows to be computed, we consider the condition satisfied by each region. The rows obtained according to each condition are computed according to one case of the recurrence. Given region $Q [x..y]$ identified by one of the four conditions this region satisfies, the following rows of the matrices have to be computed.

1.
All rows in the interval [x..y] are computed by Equation (7).
2.
One scans the structure of region $Q [x..y]$ from position y to position x until one finds a paired position y ^′. Then, all rows in the interval [y ^′ + 1..y] are computed by Equation (7).
3.
Row y is computed by recurrence (a) of Equation (8).
4.
Row row(y) is computed by recurrence (b) of Equation (8).

The sequential computation of the rows belonging to each region naturally leads to the computation of the entire alignment of $Q$ and sequence-structure edit distance $dist (Q, T)$ .

Our improvement of the ScanAlignalgorithm is based on the following two observations.

The standard dynamic programming algorithm for aligning two plain text sequences of lengths m and n requires an (m + 1) × (n + 1) matrix. Let i and j be indices of each of the matrix dimensions and a diagonal v be those entries defined by i and j such that j − i = v. Given that the cost of each edit operation is a positive value, the cost of the entries along a diagonal of the matrix are always non-decreasing [28].
Moreover, one indel operation implies that an optimal alignment path including an entry on diagonal v also includes at least one entry on diagonal v + 1 or v − 1. Now let v be the diagonal ending at the entry on the lower-right corner of the matrix and d be the number of allowed indels. One can stop the alignment computation as soon as all the entries of one row in the matrix and along diagonals v + d^′, −d ≤ d^′ ≤ d, exceed $K$ .

For our improvement of algorithm ScanAlign, based on the following Lemma, we define a diagonal for each RSSP region instead of only one for the entire matrices.

Lemma 2

Assume an RSSP $Q = (P, R)$ , a region $Q [x..y]$ of length l = y − x + 1, a window substring T[1..m^′] of the searched RNA sequence, a cost threshold $K$ , and number d of allowed indels. If for every d^′, −d ≤ d^′ ≤ min{d,x}, z ∈ {|d^′|−d,−|d^′|+d}, y + d^′ ≤ m^′, it holds that $dist (Q [x..y], T_{x + d^{'}} [1 ..l + z]) > k$ , then, for every d^′′, 0 ≤ d^′′ ≤ d, $dist (Q, T [1 .. m^{'} - d^{''}]) > k$ .

Proof

If the RSSP region $Q [x..y]$ originates from condition 1 or 2 (3 or 4) above, we define the entries on a diagonal e as those entries D P_k(i,j) (D P_k(r o w(y),j)), 1 ≤ k ± d ≤ m^′, such that j − i + offset = e, where offset = x − 1. Without loss of generality let d = 1. Assuming x − 1 > 0 and y + 1 ≤ m^′, this means that an optimal alignment of pattern $Q$ and substring T requires $Q [x..y]$ to align with:

T[x..y], T[x..y − 1], or T[x..y + 1], requiring for all three alignments the computation of $dist (Q [x..y], T_{x} [1 ..l + z])$ for z ∈ {0−1,0 + 1} = {−1,1};
T[x − 1..y − 1], requiring the computation of $dist (Q [x..y], T_{x - 1} [1 ..l + z])$ for z ∈ {|−1|−1,−|−1| + 1} = {0}; or
T[x + 1..y + 1], requiring the computation of $dist (Q [x..y], T_{x + 1} [1 ..l + z])$ for z ∈ {|1|−1,−|1| + 1} = {0}.

The alignments with T[x..y], T[x..y + 1], and T[x..y−1] end in matrix D P_x. The alignments with T[x−1..y−1] end in matrix D P_x−1, and the alignments with T[x + 1..y + 1] end in matrix D P_x+1. Every minimizing path obtained for the entire alignment of $Q$ and T can only include the entries on the diagonals e, e + 1, and/or e − 1 for the alignments with T[x..y], T[x..y + 1], and T[x..y − 1], and can only include the entries on diagonal e for the alignments with T[x − 1..y − 1] and T[x + 1..y + 1] because these substrings already imply alignments with one indel. As the sum of the cost of the edit operations on the minimizing path increases monotonically and there cannot be other minimizing paths due to the limited number of indels d, the lemma holds. □

Let $Q$ be an RSSP whose regions are sorted by the order of computation of their respective rows in the DP tables above, let d be the number of allowed indels, and T = T[1..m^′] be a window substring of the searched RNA sequence. Applying Lemma 2, we modify algorithm ScanAlign to compute the alignment of each region $Q [x..y]$ to substrings $T_{x + d^{'}}$ , −d ≤ d^′ ≤ min{d,x}, y + d^′ ≤ m^′, and progressively extend the alignment to other RSSP regions and substrings of T as long as $dist (Q [x..y], T_{x + d^{'}} [1 ..l + z]) \leq k$ , z ∈ {|d^′|−d,−|d^′| + d}, holds. That is, for each RSSP region, it determines the rows and recurrence case required for their computation according to conditions 1, 2, 3, or 4 above. Then, within each processed row i, it checks whether for at least one entry D P_k(i,j) on a possible minimizing path, i.e. on diagonals e^′, e − d ≤ e^′ ≤ e + d, DP_k(i,j) ≤ k. If no entry is below $K$ , it skips the alignment computation for all remaining RSSP regions and proceeds with aligning the next window. See Figure 2 for an example of the DP matrices of an alignment computation whose entries on a possible minimizing path are highlighted in yellow.

When scanning the searched RNA sequence, a window can be shifted before all DP matrices entries are computed. Hence, a direct application of Lemma 1 is no longer possible. To overcome this, we define an array Z in the range 1 to z, where z is the number of RSSP regions, and associate each region with an index r, 1 ≤ r ≤ z. Let p be the starting position of the window substring S[p..q] in the RNA sequence. We set Z[r] = p whenever all DP matrices rows and columns belonging to region r are computed. This occurs when the cost of aligning this region does not exceed cost threshold $K$ . Now, when aligning the same RSSP region r to a different window substring S[p^′..q^′], p^′ > p, computing all DP matrices columns requires to compute the last p^′ − p columns. If p^′ − p < m^′ (recall that m^′ = q − p = q^′−p^′), this means that the two window substrings do not overlap and therefore no DP matrix column can be reused.

Our improved algorithm, hereinafter called LScanAlign, in the worst case needs to process every RSSP region for every window shift. Hence, it has the same time complexity as algorithm ScanAlign. However, as in many cases only a few RSSP regions are evaluated, it is much faster in practice as will be shown later. ScanAlign and LScanAlign are the basis for further improvements presented in the subsequent sections.

Index-based search: LESAAlign

Suffix trees and enhanced suffix arrays are powerful data structures for exact string matching and for solving other string processing problems [29, 30]. In the following we show how the use of enhanced suffix arrays leads to even faster algorithms for searching for matches of an RSSP $Q$ in an RNA sequence S.

The enhanced suffix array of a sequence S is composed of the suffix array suf and the longest common prefix array lcp. Let $, called terminator symbol, be a symbol not in $A$ for marking the end of a sequence. $ is larger than all the elements in $A$ . suf is an array of integers in the range 1 to n + 1 specifying the lexicographic order of the n + 1 suffixes of the string S$. That is, S _suf_[1],S _suf_[2],...,S _suf_[n+1] is the sequence of suffixes of S in ascending lexicographic order. Table suf requires 4n bytes and can be constructed in O(n) time and space [31]. In practice non-linear time construction algorithms [32, 33] are often used as they are faster. lcp is a table in the range 1 to n + 1 such that lcp[1] = 0, and lcp[i] is the length of the longest common prefix between S _suf_[i−1] and S _suf_[i] for 1 < i ≤ n + 1. Table lcp requires n bytes and stores entries with value up to 255, whereas occasional larger entries are stored in an exception table using 8 bytes per entry [30]. More space efficient representations of the lcp table are possible (see [34]). The construction of table lcp can be accomplished in O(n) time and space given suf[35]. For an example of an enhanced suffix array, see Figure 4. In the following we assume that the enhanced suffix array of S has already been computed.

Consider an RSSP $Q$ to be matched against an RNA sequence S with up to d indels. For each i, 1 ≤ i ≤ n, let p_i = min{m + d,|S _suf_[i]|} be the reading depth of suffix S _suf_[i]. When searching for matches of $Q$ in S, we observe that algorithms ScanAlign and LScanAlign scan S computing $dist (Q, S [p..q])$ for every window substring of length q − p + 1 = m + d. In the suffix array, each substring S[p..q] is represented by a suffix S _suf_[i] up to reading depth p_i, i.e. there is a substring S _suf_[i][1..p_i] such that S _suf_[i][1..p_i] = S[p..q]. To match $Q$ in S using a suffix array, we simulate a depth first traversal of the lcp interval tree [30] of S on the enhanced suffix array of S such that the reading depth of each suffix is limited by p_i. That is, we traverse the suffix array of S top down, computing the sequence-structure edit distance $dist (Q, {S_{suf}}_{[i]} [1 .. p_{i}])$ for each suffix S _suf_[i]. We recall that candidate matches of $Q$ have length between m − d and m + d and that p_i ≤ m + d. In case p_i < m − d, we can skip S _suf_[i]. Also, remember that all candidate matches shorter than p_i are obtained as a by-product of the computation of $dist (Q, {S_{suf}}_{[i]} [1 .. p_{i}])$ . Hence, for every p^′, m − d ≤ p^′ ≤ p_i, if $dist (Q, {S_{suf}}_{[i]} [1 .. p^{'}]) \leq K$ we report [suf[i]..suf[i] + p^′] as a matching interval of $Q$ in S. That is, $Q$ matches substring S[suf[i]..suf[i] + p^′] beginning at position suf[i] of S.

Our algorithm for the suffix array traversal and $dist (Q, {S_{suf}}_{[i]} [1 .. p_{i}])$ computation, hereinafter called LESAAlign, builds on algorithms ScanAlign and LScanAlign. ScanAlign and LScanAlign exploit overlapping substrings of consecutive window substrings to avoid recomputation of DP matrices entries. LESAAlign exploits the enhanced suffix array in two different ways. First, for a single suffix S _suf_[i], i > 0, it benefits from the common prefix of length lcp[i] between two consecutive suffixes S _suf_[i] and S _suf_[i−1] by avoiding the recomputation of columns j, 1 ≤ j ≤ lcp[i] − k + 1, of each matrix D P_k. This means that, for lcp = min{p_i,lcp[i]}, it avoids the recomputation of $\sum_{k = 1}^{lcp} lcp - k + 1$ columns for S _suf_[i]. See an example in Figure 5. We observe that if p_i ≤ lcp, no DP entry needs to be recomputed. In this case, two situations arise:

1.
If p _i ≤ lcp and $dist (Q, {S_{suf}}_{[i - 1]} [1 .. p_{i - 1}]) \leq K$ , then clearly $dist (Q, {S_{suf}}_{[i]} [1 .. p_{i}]) \leq K$ and at least one match of $Q$ starts at position suf[i] of S; and
2.
If p _i ≤ lcp and $dist (Q, {S_{suf}}_{[i - 1]} [1 .. p_{i - 1}]) > K$ , then $dist (Q, {S_{suf}}_{[i]} [1 .. p_{i}]) > K$ .

These situations allow LESAAlign to benefit from the enhanced suffix array in a second important way. That is, it skips all suffixes S _suf_[i], S _suf_[i+1],..., S _suf_[j] sharing a common prefix of at least length lcp with S _suf_[i−1]. To find the index j of the last suffix S _suf_[j] to be skipped, it suffices to look for the largest j such that min{lcp[i],lcp[i+1],...,lcp[j]} ≥ lcp. If the first situation above holds, there are matches of $Q$ in S at positions suf[i], suf[i +1 ],..., suf[j]. We note that suffixes can also be efficiently skipped using so-called skip-tables as described in [36]. However, to save the 4n additional bytes required to store such tables we do not use them here. Our algorithm continues the top-down traversal of the suffix array with suffix S _suf_[j+1], taking into account that the DP tables were last computed for S _suf_[i−1]. Consequently, the length of the longest common prefix between S _suf_[i−1] and S _suf_[j+1] to be considered in the processing of S _suf_[j+1] is min{lcp[i],lcp[i+1],...,lcp[j],lcp[j+1]}.

We also incorporate in our index-based algorithm the early-stop alignment computation scheme of algorithm LScanAlign. This allows to skip suffixes S _suf[i] as soon as it becomes clear that the sequence-structure edit distance of RSSP $Q$ and S _suf_[i] up to reading depth p_i will exceed the cost threshold $K$ . For this, LESAAlign progressively aligns regions of $Q$ to a substring of the current suffix as in algorithm LScanAlign, checking whether the cost of each subalignment remains below the cost threshold $K$ , thus applying Lemma 2. If the cost exceeds $K$ , the alignment computation of the remaining pattern regions is skipped and the algorithm proceeds with processing the next suffix. To avoid recomputing as many entries of the DP matrices as possible while traversing the suffix array, LESAAlign differs from LScanAlign in the way it manages (non-) aligned regions for each suffix. Lemma 1, which algorithm LScanAlign applies to support early-stop computation, relies on scanning the searched RNA sequence S and overlapping window substrings. This makes it unsuitable for use with the suffix array. Instead, LESAAlign only uses information from the lcp table as follows. Let z be the number of regions of $Q$ indexed from 1 to z and T = S _suf_[i][1..p_i] be the current substring. When progressively aligning the regions of $Q$ to a substring of T, we store the index r of the first region whose alignment cost exceeds $K$ , if there is any. That is, for the first region $Q [x..y]$ whose index r we store, it holds that for every d^′, −d ≤ d^′ ≤ min{d,x}, $dist (Q [x..y], T_{x + d^{'}} [1 ..l + z]) > k$ with l = y − x + 1, z ∈ {|d^′|−d,−|d^′|+d}, and y+d^′ ≤ m + d (see Lemma 2). Then, when aligning $Q$ to a subsequent substring S _suf_[j][1..p_j], we must distinguish the regions of $Q$ previously computed from regions not computed.

Previously computed pattern regions are all regions whose index is strictly smaller than r. The alignment computation of these regions profits from the common prefix between S_suf_[i][1..p_i] and S_suf_{[ j]}[1..p_j] by avoiding the recomputation of DP matrices columns as described above.
Non-computed pattern regions are all regions whose index is larger than or equal to r. In this case, all DP matrices columns of the respective pattern region need to be computed, even if S_suf_[i][1..p_i] and S_suf_{[ j]}[1..p_j] share a common prefix.

We observe that longer ranges of suffixes not containing matches to $Q$ can be skipped thanks to the early-stop alignment computation scheme. Note that the left-most character of T needed to assert $dist (Q [x..y], T_{x + d^{'}} [1 ..l + z]) > K$ is T[x + l + d − 1] = T[x + y − x + 1 + d − 1] = T[y + d] as l = y − x + 1. Therefore, no suffix sharing prefix T[1..y + d] can match $Q$ and thus can be skipped in the top-down traversal of the suffix array of S. Because in most cases y + d < p_i, more suffixes are likely to share a prefix of length y + d than of length p_i with S _suf_[i]. For the pseudocode of algorithm LESAAlign, see Section 1 of Additional file 1.

Enhanced index-based search: LGSlinkAlign

Given an RSSP $Q$ to be searched in an RNA sequence S, algorithm LESAAlignis very fast when it can

avoid recomputation of DP matrices columns due to a common prefix between suffixes of S; and
skip long ranges of suffixes of the suffix array suf whose common prefix up to a required reading depth are known to match or not match $Q$ .

Therefore, LESAAlign exploits repetitions of substrings of S, i.e. substrings shared by different suffixes, and information of the lcp table to save computation time. However, the use of information of the lcp table alone does not necessarily lead to large speedups. Consider e.g. the DP matrices for the computation of the alignment of $Q = (AAGUUUC, ..(...))$ and substring S _suf_[4][1..p₄] = ACCCUCUU in Figure 5. The enhanced suffix array of S is shown in Figure 4. The substring S _suf_[4][1..p₄] of length 8 shares a common prefix of length lcp[4] = 4 with the previously processed substring S _suf_[3][1..p₃]. Despite this common prefix, still 182/252 ≈ 72% of the DP matrices entries need to be computed (disregarding initialization rows and columns 0) in case no early-stop is possible, i.e. in case $K > 4$ . This is more than the at most 56/252 ≈ 22% of the DP matrices entries computed by the online algorithm LScanAlign for a window shift.

Our next goal is to develop an algorithm traversing the enhanced suffix array of S that:

1.
can skip more suffixes; and
2.
improves the use of already computed DP matrices entries, reusing computed entries for as many suffixes as possible.

To address the first goal, we motivate our method by recalling the alignment computation example in Figure 2. In this example, one of the regions of $Q = (AAGUUUC, ..(…))$ is $Q [3 .. 7] = (GUUUC, (…))$ . Assume $K = d = 1$ and observe that $dist (Q [3 .. 7], T_{3 + d^{'}} [1 .. 5 + z]) > 1$ for every d^′, −1 ≤ d^′ ≤ 1, z ∈ {|d^′|−1,−|d^′| + 1}, i.e. the alignment cost for this pattern region already exceeds the cost threshold of 1 (in accordance with Lemma 2). In other words, $Q [3 .. 7]$ cannot align to any of the substrings T[2..6] = CCCUC, T[3..6] = CCUC, T[3..7] = CCUCU, T[3..8] = CCUCUU, or T[4..8] = CUCUU with a cost lower than 1. Observe further that the alignment computation of region $Q [3 .. 7]$ does not depend on any previous computation of any other region. We can therefore conclude that no suffix containing substring T[2..8]=CCCUCUU from position 2 to 8 can match $Q$ , independently of any prefix of length 1. Our goal is to find and eliminate from the search space all such suffixes, in addition to skipping all suffixes sharing prefix T[1..8] as performed by LESAAlign. That is, we want to skip suffixes sharing a substring, not limited to a prefix, whose alignment cost to a pattern region exceeds cost threshold $K$ .

Let S be an arbitrary RNA sequence and T[x..y] = S _suf_[i][x..y] contain all substrings whose alignment cost to a region of an RSSP $Q$ exceeds threshold $K$ . Consider the following two cases for skipping suffixes that cannot match $Q$ as a consequence of containing substring T[x..y] from position x to y. (1) For any value of x, all suffixes sharing prefix T[1..y] can be skipped as performed by algorithm LESAAlign. (2) Now let x > 1. To find all suffixes of S sharing substring T[x..y] from position x to y, we first locate all suffixes sharing T[x..y] as a prefix. We begin by locating one such suffix, in particular the suffix of index suf[j] that contains all but the first x^′ = x − 1 characters of S _suf_[i], i.e. suffix ${S_{suf}}_{[j]} = {S_{suf}}_{[i] + x^{'}}$ . We determine j using a generalization of a concept originated from suffix trees. It is a property of suffix trees that for any internal node spelling out string T there is also an internal node spelling out T₂ whenever |T| > 1 [37]. A pointer from the former to the latter node is called a suffix link. In the case of suffix arrays, a suffix link can be computed using the inverse suffix array suf⁻¹ of S$. suf⁻¹ is a table in the range 1 to n + 1 such that suf⁻¹[suf[i]] = i. It requires 4n bytes and can be computed via a single scan of suf in O(n) time. Given table suf⁻¹, we can define the suffix link from T = S _suf_[i] to T₂ = S _suf_[i]+1 as link = suf⁻¹[suf[i] + 1], i.e. it holds that suf[link] = suf[i] + 1. Now, if x^′ = 1, we already find that the index suf[j] of the suffix containing all but the first character of S _suf_[i] is suf[j] = suf[link] because ${S_{suf}}_{[link]} = {S_{suf}}_{[i] + x^{'}}$ holds. However, we also want to be able to determine j for any x^′ ≥ 1. The obvious solution is to compute suffix links x^′ successive times. Each suffix link skips the first character of the previously located suffix. For a more efficient solution, we generalize suffix links to point directly to the suffix without a prefix of any length x^′ of the initial suffix. For this purpose we define a function $link : N \times N \to N$ as:

link (i, x^{'}) = {suf}^{- 1} [suf [i] + x^{'}] .

(10)

Then, by letting j = link(i,x^′), ${S_{suf}}_{[link (i, x^{'})]} = {S_{suf}}_{[i] + x^{'}}$ holds for any x^′ ≥ 1. All suffixes sharing T[x..y] as a prefix are all suffixes in the range j_start to j_end where j_start is the smallest and j_end is the largest index satisfying min{lcp[j_start + 1],...,lcp[j],...,lcp[j_end]} ≥ y − x + 1. Finally, we find that all suffixes of S sharing substring T[x..y] from position x to y are all ${S_{suf}}_{[j^{'}] - x^{'}}$ , $j_{start} \leq j^{'} \leq j_{end}$ , satisfying $suf [j^{'}] > x^{'}$ . To skip these suffixes not containing matches to $Q$ in the top-down traversal of the suffix array suf, we mark their positions as true (for already“processed”) in a bit array vtab of n bits. The suffix array traversal proceeds from position suf[i], but skips the marked suffixes when their positions are reached.

We remark that the described method for skipping suffixes can profit from a resorting according to the order by which RSSP regions are aligned. In the alignment computation example in Figure 2, determining $dist (Q [3 .. 4], T_{3 + d^{'}} [1 .. 2 + z]) > 1$ , −1≤d^′ ≤ 1, z ∈ {|d^′|−1,−|d^′| + 1}, does not depend on character T[1] and region $Q [1 .. 1]$ . Hence, region $Q [1 .. 1]$ is unnecessarily aligned first when the regions are sorted by a top-down analysis of the DP tables. To decrease the chance that unnecessary computations occur, we sort the RSSP regions to begin aligning with the left-most RSSP region $Q [x..y]$ not depending on the alignment of any other region and satisfying x − d > 1.

We now address the second goal, namely reusing computed DP matrices entries for as many suffixes as possible. Recall that computing the sequence-structure edit distance $dist (Q, {S_{suf}}_{[i]} [1 .. p_{i}])$ for each suffix S _suf_[i] up to reading depth p_i means computing p_i + 1DP matrices, one for each suffix T_k of string T = S _suf_[i][1..p_i], 1 ≤ k ≤ m^′, and one for the empty sequence ε. Observe that each suffix T_k, T_k ≠ T, also occurs itself as a prefix of a suffix in table suf, i.e. there exists a suffix S _suf_[j] shorter than S _suf_[i] by exactly k − 1 characters which has prefix T_k. Consequently, T_k is processed again in an alignment to RSSP $Q$ at a different point in time during the traversal of suf. Let T^′ = S _suf_[j][1..p_j]. Now note that if T^′ is at a (nearly) contiguous position in suf to T, T^′ and T are likely to share a common prefix due to their similar lexicographic ranking. This allows algorithm LESAAlign to avoid recomputation of DP matrices columns by using information from the lcp table. Unfortunately, T^′ and T can be lexicographically ranked far away from each other in table suf, meaning that the DP matrices computed for T^′either:

were already computed once because T^′ is lexicographically smaller than T, but were discarded to allow the processing of other suffixes until T was traversed; or
are computed for the first time otherwise, but will not be reused to also allow the processing of other suffixes until T^′ occurs in table suf as a prefix of a suffix itself.

In both cases, redundant computations occur. To avoid this, we optimize the use of computed DP matrices by processing T^′ directly after processing T for fixed k = 2, recalling that T = S _suf_[i][1..p_i] and T^′ = S _suf_[j][1..p_j]. This value of k implies that S _suf_[j] does not contain the first character of S _suf_[i] and that we can locate S _suf_[j] in table suf by computing the suffix link j = link(i,1). Also, k = 2 implies that T^′ only differs by its last character from T, aside from not beginning with character T[1]. Therefore, to determine $dist (Q, T^{'})$ , we only have to compute the last column of the DP matrices required to compute $dist (Q, T)$ as shown by Lemma 1. We note that, because i and j are not necessarily contiguous positions in suf, we mark the processed suffix S _suf_[j] in the bit array vtab so that it is only processed once. If no match to RSSP $Q$ begins at position suf[j], we also mark and skip every suffix sharing the substring with T^′ whose alignment to a region of $Q$ is known to exceed threshold $K$ . Once T^′ is processed and all possible suffixes are skipped, we recursively repeat this optimization scheme by setting T = T^′ and processing the next $T^{'} = {S_{suf}}_{[j^{'}]} [1 .. p_{j^{'}}]$ where j^′ = link(j,1). The recursion stops when $p_{j^{'}} < m - d$ , meaning that T^′ is too short to match $Q$ , or when suf[j^′] is already marked as processed in vtab. The suffix array traversal proceeds at position i + 1 repeating the entire scheme.

We call our algorithm incorporating the presented improvements LGSlinkAlign. For its pseudocode, see Section 1 of Additional file 1. LGSlinkAligninherits all the improvements of the above presented algorithms. In summary, its improvements are as follows.

LGSlinkAlign traverses the enhanced suffix array of the searched sequence S, i.e. the suffix array suf enhanced with tables lcp and suf⁻¹. During this traversal, it benefits from common prefixes shared among suffixes to (1) avoid the computation of DP matrix columns and to (2) skip ranges of suffixes known to match or not match RSSP $Q$ as in algorithm LESAAlign.
The suffix array traversal is predominantly top down, but non-contiguous suffixes are processed to optimize the use of computed DP matrices.
LGSlinkAlign stops the alignment computation as early as the alignment cost of a region of RSSP $Q$ and a substring of the prefix of the current suffix exceeds threshold $K$ , an improvement first introduced in algorithm LScanAlign.
Due to the early-stop computation scheme, suffixes sharing common prefixes shorter than m + d can be skipped, leading to larger ranges of skipped suffixes. The early-stop computation scheme also helps to identify and skip non-contiguous suffixes sharing a common substring which is not their prefix.

Example: searching for an RSSP with algorithm LGSlinkAlign

We elucidate the ideas of algorithm LGSlinkAlign with the following example. Consider the RSSP $Q = (AAGUUUC, ..(...))$ to be matched in the sequence S whose enhanced suffix array is shown in Figure 4. To keep the example simple, we only allow a small cost threshold and number of indels, i.e. we set $K = d = 1$ . The costs of the edit operations are ω_d = ω_m = ω_b = ω_a = 1 and ω_r = 2. When traversing the enhanced suffix array of S, LGSlinkAlign always begins to align $Q$ to a substring of S with region $Q [4 .. 6]$ , because the alignment computation of this region does not depend on any other region. In addition, the left index of this region satisfies 4 − d > 1. This means that the alignment computation of region $Q [1 .. 2]$ is avoided if the cost of aligning region $Q [4 .. 6]$ exceeds the threshold $K$ . The algorithm starts the traversal of the enhanced suffix array of S aligning $Q [4 .. 6]$ to substrings of T = S _suf_[1][1..p₁] = S₁₄[1..8] from positions 4 − d = 3 and 6 + d = 7. For this, it computes $dist (Q [4 .. 6], T_{4 + d^{'}} [1 .. 3 + z])$ for −1 ≤ d^′ ≤ 1 and z ∈ {|d^′|−1,−|d^′|+1}. Observe that $dist (Q [4 .. 5], T_{4 + d^{'}} [1 .. 2 + z]) > 1$ holds. Hence (1) no suffix with prefix T[1..6] = AACACC can match $Q$ and thus can be skipped and (2) no suffix containing substring T[3..6] = CACC from position 4−d = 3 to 5 + d = 6 can match $Q$ and thus can be skipped as well. We notice that there is no other suffix with prefix AACACC because lcp[2] < 6, so we analyze case (2). The algorithm looks for suffixes sharing substring CACC from position 3 to 6. It begins by locating suffixes without the first two characters of T and containing CACC as a prefix. It follows the suffix link link(1,2) = suf⁻¹[suf[1]+2] = suf⁻¹[16] = 7 and looks for the smallest j_start and largest j_end satisfying min{lcp[j_start + 1],...,lcp[8],...,lcp[j_end]} ≥ 4 = |CACC|. It finds that j_start = 5 and j_end = 8, since min{lcp[5+1],lcp[7],lcp[8]} = min{4,5,5} ≥ 4 holds. The suffixes containing CACC from position 3 to 6 are S _suf_[5]−2 = S₁₁, S _suf_[6]−2 = S₇, and S _suf_[8]−2 = S₁₄. S₁₁and S₇ are marked in the bit array vtab, whereas S₁₄ = S _suf_[1] was already processed and does not need to be marked. We observe that S _suf_[7]−2 = S₋₁ is not a valid suffix. To reuse as many computed DP matrices entries as possible, the algorithm next processes the suffix S _suf_[j] which does not contain the first character of S _suf_[1]. It determines j = link(1,1)=suf⁻¹[suf[1]+1] = 11 and sets T = S _suf_[12][1..p₁₂] = S₁₅[1..8]. The alignment to this substring T begins with its substrings from positions 3 to 7 and $Q [4 .. 6]$ . We observe that $dist (Q [4 .. 5], T_{4 + d^{'}} [1 .. 2 + z]) > 1$ holds and consequently T cannot match $Q$ . Because suffix S _suf_[12] = S₁₅ was traversed via a suffix link, it is marked as processed in vtab. We now again analyze two cases of suffixes that cannot match $Q$ and therefore can be skipped: (1) suffixes sharing prefix T[1..6] = CCACCC and (2) suffixes containing substring T[3..6] = ACCC from position 3 to 6. Satisfying case (1) are suffixes S _suf_[11] = S₁ and S _suf_[10] = S₈ since lcp[12] ≥ 6 and lcp[11] ≥ 6. These suffixes are marked in vtab. We now check if there are suffixes satisfying case (2). The algorithm begins by locating suffixes containing substring T[3..6] = ACCC as a prefix. For this, it follows the suffix link link(12,2) = suf⁻¹[suf[12] + 2] = 4 and determines j_start = 2 and j_end = 4. The property min{lcp[2 + 1],lcp[4]} ≥ 4 is satisfied. The suffixes containing ACCC from position 3 to 6 are S _suf_[2]−2 = S₈, S _suf_[3]−2 = S₁, and S _suf_[4]−2 = S₁₅. Since these were already marked in vtab, none of them needs to be marked. The algorithmic scheme of LGSlinkAlign to reuse as many computed DP matrices entries as possible continues processing other suffixes which are located by iteratively following the suffix links. It locates suffixes S _suf_[8], S _suf_[4], S _suf_[18], and S _suf_[19] because link(12,1) = 8, link(8,1) = 4, link(4,1) = 18, and link(18,1) = 19, respectively. These suffixes are processed analogously as above, one after the other, not resulting in matches to $Q$ . The iteration then leads to suffix S _suf_[20], since link(19,1) = 20. However, |S _suf_[20]| < m−d, meaning that this suffix is too short to contain a match to $Q$ . This causes the iteration to stop. The suffix array traversal proceeds and repeats the entire matching scheme from the suffix that follows the last processed suffix not located via a suffix link, i.e. suffix S _suf_[2]. After processing and skipping all possible suffixes, we note that LGSlinkAlign does not report any matches for the defined cost threshold and allowed number of indels $K = d = 1$ . By setting $K = 5$ , it reports a match at position 16.

RNA secondary structure descriptors based on multiple ordered RSSPs

RNAs with complex branching structures often cannot be adequately described by a single RSSP due to difficulties in balancing sensitivity, specificity, and reasonable running time of the used search algorithm. Although their description by a single short RSSP specifying an unbranched fragment of the molecule might be very sensitive, it is often too unspecific and likely to generate many spurious matches when searching for structural homologs in large sequence databases or complete genomes. In contrast, using a single long RSSP often requires a higher cost threshold $K$ for being sensitive enough which in turn, together with the increased RSSP length, has a negative influence on the search time. This might lead to disadvantageous running times in larger search scenarios in practice.

We solve this problem by applying the powerful concept of RNA secondary structure descriptors (SSDs for short) recently introduced in [23]. The underlying concept of SSDs is similar to the idea of PSSM family models [38], which are successfully used for fast and sensitive protein homology search. SSDs use the information of multiple ordered RSSPs derived from the decomposition of an RNA’s secondary structure into stem-loop like structural elements. In a first step, approximate matches to the single RSSPs the SSD consists of are obtained using one of the algorithms presented above. From these matches, either local or global high-scoring chains are computed with the efficient chaining algorithms described in [23]. These algorithms take the chain’s score, i.e. the weights of the fragments in the chain, into account (see [23] for details). For chaining of approximate RSSP matches, we use the fragment weight $ω_{Q}^{*} - dist (Q, T)$ for an RSSP $Q$ of length m matching substring T, where $ω_{Q}^{*} = m * ω_{m} + bps * ω_{r}$ and bps denotes the number of base pairs in $Q$ . Here $ω_{Q}^{*}$ is the maximal possible weighting $Q$ can gain when being aligned and therefore it reflects the situation of a perfect match between $Q$ and T. With this definition of a fragment’s weight, a positive weight is always guaranteed, thus satisfying a requirement for the chaining algorithm. Once the chaining of matches to the RSSPs is completed, the high-scoring chains are reported in descending order of their chain score. By restricting to high-scoring chains, spurious RSSP matches are effectively eliminated. Moreover, the relatively short RSSPs used in an SSD can be matched efficiently with the presented algorithms leading to short running times that even allow for the large scale application of approximate RSSP search.

Results and discussion

Implementation and computational results

We implemented (1) the fast index-based algorithms LESAAlign and LGSlinkAlign, (2) the online algorithms LScanAlign, ScanAlign, both operating on the plain sequence, and (3) the efficient global and local chaining algorithms described in [23]. In our experiments we use ScanAlign, which is the scanning version of the method proposed in [25], for reference benchmarking. All algorithms are included in the program RaligNAtor. The algorithms for index construction were implemented in the program sufconstruct, which makes use of routines from the libdivsufsort2 library (see http://code.google.com/p/libdivsufsort/) for computing the suf table in O(n log n) time. For the construction of table lcp we employ our own implementation of the linear time algorithm of [35]. All programs were written in C and compiled with the GNU C compiler (version 4.5.0, optimization option -O3). All measurements are performed on a Quad Core Xeon E5620 CPU running at 2.4 GHz, with 64 GB main memory (using only one CPU core). To minimize the influence of disk subsystem performance, the reported running times are user times averaged over 10 runs. Allowed base pairs are canonical Watson-Crick and wobble, unless stated otherwise. The used sequence-structure operation costs are ω_d = ω_m = ω_b = ω_a = 1 and ω_r = 2.

Comparison of running times

In a first benchmark experiment we measure the running times needed by the four algorithms to search with a single RSSP under different cost thresholds $K$ and number of allowed indels d. We set (1) $K = d$ varying the values in the interval [0,6], (2) $K = 6$ varying d in the interval [0,6], and (3) d = 0 varying $K$ in the interval [0,6]. The searched dataset contains 2,756,313 sequences with a total length of ≈786 MB from the full alignments of all Rfam release 10.1 families. The construction of all necessary index tables needed for LESAAlign and LGSlinkAlign with sufconstruct and their storage on disk required 372 seconds. In the following we refer to this dataset as RFAM10.1 for short. In this experiment we use the RSSP tRNA-pat of length m = 74 shown in Figure 6 describing the consensus secondary structure of the tRNA family (Acc.: RF00005). The results of this experiment are presented in Figure 7 and Table S4, S5, and S6 of Additional file 1. LGSlinkAlign and LESAAlign are the fastest algorithms. LGSlinkAlign is faster in particular for increasing values of $K$ and d, being only slower than LESAAlign for small values of $K$ and d and for fixed d = 0. The advantage of LGSlinkAlign over LESAAlign with higher values of $K$ and d is explained by the increased reading depth in the suffix array implicated by $K$ and d and the fewer suffixes sharing a common prefix that can be skipped. This holds for both LGSlinkAlign and LESAAlign, however LGSlinkAlign counterbalances this effect by reusing computed DP matrices for non-contiguous suffixes of the suffix array. In a comparison to the two online algorithms considering only approximate matching, i.e. $K \geq 1$ , the speedup factor of LGSlinkAlign over ScanAlign (LScanAlign) is in the range from 560 for $K = 1$ and d = 0 to 17 for $K = d = 6$ (from 15 for $K = 2$ and d = 0 to 3 for $K = d = 6$ ). LESAAlign achieves a speedup factor over ScanAlign (LScanAlign) in the range from 1,323 for $K = 1$ and d = 0 to 9 for $K = d = 6$ (29 for $K = 1$ and d = 0 to 1.6 for $K = d = 6$ ). In a comparison between the online algorithms, LScanAlign is faster than ScanAlign by up to factor 45 for $K \geq 1$ . In summary, all algorithms except ScanAlign profit from low values of $K$ and d reducing their search times. This is a consequence of the use of the early-stop alignment computation scheme. As shown in Figure 7(2), also the number of allowed indels d influences the search time. For an additional experiment investigating the influence of $K$ and d on the search time required by the four algorithms, see Section 2 of Additional file 1. A further experiment, described in Section 3 of Additional file 1, compares RaligNAtor and the widely used tool RNAMotif[15] in terms of sensitivity and specificity in searches for the tRNA-pat depicted in Figure 6.

Scaling behavior of the online and index-based algorithms

In a second experiment we investigate how the search time of algorithms ScanAlign, LScanAlign, LESAAlign, and LGSlinkAlign scales on random subsets of RFAM10.1 of increasing size. The searched RSSPs flg1, flg2, and flg3 were derived from the three stem-loop substructures the members of family flg-Rhizobiales RNA motif (Acc.: RF01736) [40] fold into. These patterns differ in length, cost threshold $K$ and number of allowed indels d; see Figure 8 for their definition, noting that $K$ and d are simply denoted cost and indels in the RaligNAtor RSSP syntax. The results are shown in Figure 9 and Table S7 of Additional file 1. LGSlinkAlign and LESAAlign show a sublinear scaling behavior, whereas LScanAlign and ScanAlign scale linearly. The fastest algorithm is LGSlinkAlign, requiring only 11.68 (53.08) minutes to search for all three patterns in the smallest (full) subset. The second fastest algorithm is LESAAlign, followed by LScanAlign and ScanAlign, which require 32.27 (126.97), 40.47 (321.01), and 98.35 (754.66) minutes, respectively, to search for all the patterns in the smallest (full) subset. This corresponds to a speedup of 8.4 to 14.2 of LGSlinkAlign over ScanAlign on the smallest and the full subsets. Comparing the search time for pattern flg3 individually, the speedup of LGSlinkAlign over ScanAlign ranges from 22.6 to 38.8. We also observe that ScanAlign requires the longest time to match the longest pattern flg2 of length m = 37. The other algorithms profit from the early-stop computation approach to reduce the search time for this pattern on every database subset.

Influence of stem and loop lengths on the search time

When searching a database for matches of a given pattern, our algorithms compute the required DP matrices using recurrences according to two main cases: either a row corresponds to an unpaired or to a paired base of the pattern. To analyze the influence of the used recurrence on the search time of each algorithm, we search RFAM10.1 for artificial stem-loop patterns. Therefore we vary the number of bases in the loop of pattern $Q = (NNNACANNN, (((...))))$ from 3 to 12 by using As and Cs. Additionally, we vary the number of base pairs in the stem of pattern $Q = (NNACANN, ((...)))$ from 2 to 11 by pairs of Ns. Matching the patterns in these two experiments means to increase the use of the DP recurrences in Equations (7) and (8), respectively. The cost threshold and the number of allowed indels are fixed at $K = d = 3$ . Allowed base pairs are (A, U), (U, A), (C, G), and (G, C). The results are shown in Figure 10. We observe that increasing the number of bases in the loop has little influence and even reduces the running time of the two fastest algorithms LGSlinkAlign and LESAAlign. This can be explained by the use of the early-stop alignment computation scheme in these algorithms. The reduction of the running time is explained by the fewer matches that need to be processed as the pattern gets longer and more specific. For an increasing number of base pairs in the stem, LGSlinkAlign is the least affected algorithm. We also observe that the linear increase in running time of the basic online algorithm ScanAlign, caused by an extension of the pattern by one base pair, is similar to the effect of adding two bases in the loop.

RNA family classification by global chaining of RSSP matches

In the next experiment we show the effectiveness of global chaining when searching with two SSDs built for Rfam families Cripavirus internal ribosome entry site (Acc.: RF00458) and flg-Rhizobiales RNA motif (Acc.: RF01736) [40]. These two families present only 53% and 69% sequence identity, respectively, much below the average of ∼80% of the Rfam 10.1 families. This illustrates the importance of using both sequence and structure information encoded in the SSDs of this experiment. The SSD of family RF01736 comprises three RSSPs, denoted by flg1, flg2, and flg3 in Figure 8, derived from the three stem-loop substructures the members of this family fold into. The SSD of family RF00458 comprises five RSSPs, denoted by ires1, ires2, ires3, ires4, and ires5 in Figure S5 of Additional file 1, where the last four RSSPs describe the stem-loop substructures the members of this family fold into. ires1 describes a moderately conserved strand occurring in these members. Observe also in Figures 8 and S5 the cost threshold $K$ and allowed number of indels d used per pattern, remembering that these are denoted cost and indels in the RaligNAtor RSSP syntax.

Searching with the SSD of family RF00458 in RFAM10.1 delivers 16,033,351 matches for ires1, 8,950,417 for ires2, 1,052 for ires3, 112 for ires4, and 1,222,639 for ires5. From these matches, RaligNAtor computes high-scoring chains of matches, eliminating spurious matches and resulting in exactly 17 chains. Each chain occurs in one of the 16 sequence members of the family in the full alignment except in sequence AF014388, where two chains with equal score occur. The highest (lowest) chain score is 171 (162). Using ScanAlign, LScanAlign, LESAAlign, and LGSlinkAlign, the search for all five RSSPs requires 688.32, 585.59, 186.88, and 92.25 minutes, respectively, whereas chaining requires 13.66 seconds. See Table S8 of Additional file 1 for the time required to match each pattern using the different algorithms.

The same search is performed using the SSD of family RF01736. It results in 4,145 matches for flg1, 68,024 for flg2, and 67 for flg3. Chaining the matches leads to 15 chains occurring each in one of the 15 sequence members of the family in the full alignment. The highest (lowest) chain score is 163 (156). Using ScanAlign, LScanAlign, LESAAlign, and LGSlinkAlign, the search for all three RSSPs requires 755.48, 336.69, 133.58, and 52.86 minutes, respectively, whereas chaining requires 0.03 seconds. The time required to match each pattern using each algorithm is reported in Table S9 of Additional file 1.

We also show that the lack of the sequence-structure edit operations supported by RaligNAtor deteriorates sensitivity and specificity in the search for sequence members of families RF00458 and RF01736. For this, we report in Section 4 and Table S10 of Additional file 1 results obtained with the Structator tool [23]. Structator is much faster but, in contrast to RaligNAtor, does not support all sequence-structure edit operations.

Importance of structural constraints for RNA family classification

To assess the potential of using RSSPs for reliable RNA homology search on a broader scale and to investigate the effect of using base pairing information, we evaluated RaligNAtor on 35 RNA families taken from Rfam 10.1 with different degrees of sequence identity and of different sizes. See Table 1 for more information about the selected families. In our experiment, we compared (1) RaligNAtor results obtained by using RSSPs derived from Rfam seed alignments with (2) results obtained for the same RSSPs ignoring base pairing information and (3) results obtained by blastn[41] searches with the families’ consensus sequence. For each selected family, we automatically compiled an RSSP $Q = (P, R)$ from the family’s seed alignment using the following procedure: at each position of the RSSP’s sequence pattern P, we choose the IUPAC wildcard matching all symbols in the corresponding alignment column. As structure string R, we use the secondary structure consensus available in the Rfam seed alignment. From the resulting RSSPs we remove the maximum prefix and suffix containing neither sequence information (i.e. IUPAC symbol N) nor base pairing information. To obtain a query sequence for blastn, we compute the consensus sequence from the family’s seed alignment. Because blastn does not appropriately handle IUPAC wildcard characters in the query, we choose the most frequent symbol occurring in a column as representative symbol in the consensus sequence. For the RaligNAtor searches, we adjust the cost threshold $K$ and number of allowed indels d such that we match the complete family. That is, we achieve a sensitivity of 100%. The used operation costs are ω_d = ω_m = 1, ω_b = ω_a = 2, and ω_r = 3. For the Blast searches, we called blastn with parameters -m8 -b 250000 -v 250000 and a very relaxed E-value cutoff of 1000. From the two RaligNAtor and one blastn outputs we count the number of true positives (#TPs) and false positives (#FPs) and compute ROC curves on the basis of the RaligNAtor score $ω_{Q}^{*} - dist (Q, T)$ and the blastn bit score. See Table 1 and Figure 11 for the results of this experiment. A ROC curve with values averaged over all families is shown in Figure 11(1).

Table 1 Results of RaligNAtor and blastn database searches for members of RNA families of different degrees of sequence identity in RFAM10.1

Full size table

In addition, we show in Figures 11(2) and (3) the results of the ROC analysis for the families with the lowest and highest degree of sequence identity. For the ROC curve of each selected family, see Figures S7 and S8 of Additional file 1. Clearly, by using base pairing information, RaligNAtor achieves a higher sensitivity with a reduced false positive rate compared to searches ignoring base pairing (compare columns “RaligNAtor” and “RaligNAtor (sequence only)” in Table 1). This is in particular evident when searching for families with a low degree of sequence identity. This can be explained by the small amount of information left in the RSSP for such a family, once the structural information is removed. Due to the high variability of bases in the columns of the multiple alignment of the family, the pattern contains a large number of wildcards. These symbols alone, without the constraints imposed by the base pairs, lead to unspecific patterns and therefore to a large number of false positives. We observe that, for families with sequence identity of up to 59%, the area under the curve (AUC) is considerably larger when base pairing information is taken into account. This difference decreases with increasing sequence identity (compare Figures 11(2) and (3)). Overall, the average AUC value over all families is, with a value of 0.93, still notably higher when base pairing information is considered compared to 0.89 if base pairing information is ignored (see Table 1). In this experiment, blastn only finds all members of those families whose sequence identity is at least 85%. This is due to the fact that blastn cannot appropriately handle IUPAC wildcard characters. Hence, by taking the most frequent symbol in an alignment column as consensus symbol, the heterogeneity of less conserved positions in the alignment cannot be adequately modeled. For the blastn searches, the average AUC value over all families is only 0.72.

RaligNAtorsoftware package

RaligNAtor is an open-source software package for fast approximate matching of RNA sequence-structure patterns (RSSPs). It allows the user to search target RNA or DNA sequences choosing one of the new online or further accelerated index-based algorithms presented in this work. The index of the sequence to be searched can be easily constructed with program sufconstruct distributed with RaligNAtor.

Searched RSSPs can describe any (branching, non-crossing) RNA secondary structure; see examples in Figures 1, 6, 8, and S5 of Additional file 1. Bases composing the sequence information of RSSPs can be ambiguous IUPAC characters. As part of the search parameters for RSSPs, the user can specify the cost of each sequence-structure edit operation defined above, the cost threshold of possible matches, and the number of allowed indels. The RSSPs, along with costs and thresholds per RSSP, are specified in a simple text file using a syntax that is expressive but easy to understand as shown in the mentioned figures. Another possibility is to provide the same costs and thresholds for all searched patterns as parameters in the command line call to RaligNAtor. To ensure maximal flexibility, the user can also define the base pairing rules from an arbitrary subset of $A \times A$ as valid pairings in a separate text file. Searches can be performed on the forward and reverse strands of the target sequence. Searching on the reverse strand is implemented by reversal of the RSSP and transformation according to Watson-Crick base pairing. Wobble pairs {(G,U), (U,G)} automatically become {(C,A), (A,C)}. Due to these transformations, the index is built for one strand only.

For describing a complex RNA with our concept of secondary structure descriptor (SSD), i.e. with multiple RSSPs, the user specifies all RSSPs in one text file. The order of the RSSPs in the file will then specify the order of the RSSP matches used to build high-scoring chains. The chain score directly depends on the score of each match occurring in the chain. This is inversely proportional to the sequence-structure edit distance of the RSSP and its matching substring in the target sequence. Hence, higher scores indicate sequences with a higher conservation which are probably more closely related to the sought RNA family.

Chaining of matches discards spurious matches not occurring in any chain. An additional filtering option eliminates matches overlapping another with a higher score for the same RSSP. This is particularly useful when indels lead to almost identical matches that are only shifted by a few positions in the target sequence.

The output of RaligNAtor includes not only matching positions to single RSSPs and chains, but their sequence-structure alignment to the matched substrings as well. In the RaligNAtor software package, all programs for searching patterns support multithreading to take advantage of computer systems with multiple CPU cores. There are two modes of parallelism. At first, different patterns are searched using multiple threads. Additionally, the search space (i.e. the sequence for the online algorithms and the index structure for the index-based methods) is partitioned, processing each part using a different thread. Lastly, we remark that our software also provides an implementation of the original algorithm of Jiang et al. for global sequence-structure alignment [25], easily applicable by the user.

Conclusions

We have presented new index-based and online algorithms for fast approximate matching of RNA sequence-structure patterns. Our algorithms, all implemented in the RaligNAtor software, stand out from previous search tools based on motif descriptors by supporting a full set of edit operations on single bases and base pairs. See Table 2 for an overview of the algorithms. In each algorithm, the application of a new computing scheme to optimally reuse the entries of the required dynamic programming matrices and an early-stop technique to avoid the alignment computation of non-matching substrings led to considerable speedups compared to the basic scanning algorithm ScanAlign. Our experiments show superior performance of the index-based algorithms LGSlinkAlign and LESAAlign, which employ the suffix array data structure and achieve running time sublinear in the length of the target database. When searching for approximate matches of biologically relevant patterns on the Rfam database, LGSlinkAlign (LESAAlign) was faster than ScanAlign and LScanAlign by a factor of up to 560 (1,323) and 17 (29), respectively (see Figure 7). Comparing the two index-based algorithms, LESAAlign was faster than LGSlinkAlign when searching with tight cost threshold (i.e. sequence-structure edit distance) and no allowed indels, but became considerably slower when the number of allowed indels was increased. In this scenario, LGSlinkAlign was faster than LESAAlign by up to 4 times. In regard to the two online algorithms, LScanAlign was faster than ScanAlign by up to factor 45. In summary, LGSlinkAlign is the best performing algorithm when searching with diverse thresholds, whereas LScanAlign is a very fast and space-efficient alternative. RaligNAtor also allows to use the powerful concept of RNA secondary descriptors [23], i.e. searching for multiple ordered sequence-structure patterns each describing a substructure of a larger RNA molecule. For this, RaligNAtor integrates fast global and local chaining algorithms. We further performed experiments using RaligNAtor to search for members of RNA families based on information from the consensus secondary structure. In these experiments, RaligNAtor showed a high degree of sensitivity and specificity. Compared to searching with primary sequence only, the use of secondary structure information considerably improved the search sensitivity and specificity, in particular for families with a characteristic secondary structure but low degree of sequence conservation. We remark that, up to now, RaligNAtor uses a relatively simple scoring scheme. By incorporating more fine grained scoring schemes like RIBOSUM [13] or energy based scoring [42], we believe that the performance of RaligNAtor for RNA homology search can be further improved. Beyond the algorithmic contributions, we provide with the RaligNAtor software distribution, a robust, well-documented, and easy-to-use software package implementing the ideas and algorithms presented in this manuscript.

Table 2 Overview of the presented algorithms

Full size table

Availability

The RaligNAtor software package including documentation is available in binary format for different operating systems and architectures and as source code under the GNU General Public License Version 3. See http://www.zbh.uni-hamburg.de/ralignator for details.

References

Mattick J: RNA regulation: a new genetics?. Nat Rev Genet. 2004, 5 (4): 316-323. 10.1038/nrg1321.
Article CAS PubMed Google Scholar
Burge SW, Daub J, Eberhardt R, Tate J, Barquist L, Nawrocki EP, Eddy SR, Gardner PP, Bateman A: Rfam 11.0: 10 years of RNA families. Nucleic Acids Res. 2012, 41 (D1):
Siebert S, Backofen R: MARNA: multiple alignment and consensus structure prediction of RNAs based on sequence structure comparisons. Bioinformatics. 2005, 21 (16): 3352-3359. 10.1093/bioinformatics/bti550.
Article CAS PubMed Google Scholar
Höchsmann M, Voss B, Giegerich R: Pure multiple RNA secondary structure alignments: a progressive profile approach. IEEE/ACM Trans Comput Bio Bioinformatics. 2004, 1: 53-62. 10.1109/TCBB.2004.11.
Article Google Scholar
Sankoff D: Simultaneous solution of the RNA folding, alignment and protosequence problem. SIAM J Appl Mathe. 1985, 45: 810-825. 10.1137/0145048.
Article Google Scholar
Will S, Reiche K, Hofacker IL, Stadler PF, Backofen R: Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering. PLoS Comput Biol. 2007, 3 (4): e65+-
Article PubMed Central PubMed Google Scholar
Havgaard JH, Torarinsson E, Gorodkin J: Fast pairwise structural RNA alignments by pruning of the dynamical programming matrix. PLoS Comput Biol. 2007, 3 (10): e193+-
Article PubMed Central Google Scholar
Torarinsson E, Havgaard JH, Gorodkin J: Multiple structural alignment and clustering of RNA sequences. Bioinformatics. 2007, 23 (8): 926-932. 10.1093/bioinformatics/btm049.
Article CAS PubMed Google Scholar
Mathews DH, Turner DH: Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. J Mol Biol. 2002, 317 (2): 191-203. 10.1006/jmbi.2001.5351.
Article CAS PubMed Google Scholar
Mathews DH: Predicting a set of minimal free energy RNA secondary structures common to two sequences. Bioinformatics. 2005, 21 (10): 2246-2253. 10.1093/bioinformatics/bti349.
Article CAS PubMed Google Scholar
Dalli D, Wilm A, Mainz I, Steger G: STRAL: progressive alignment of non-coding RNA using base pairing probability vectors in quadratic time. Bioinformatics. 2006, 22 (13): 1593-1599. 10.1093/bioinformatics/btl142.
Article CAS PubMed Google Scholar
Nawrocki EP, Kolbe DL, Eddy SR: Infernal 1.0: inference of RNA alignments. Bioinformatics. 2009, 25 (10): 1335-1337. 10.1093/bioinformatics/btp157.
Article PubMed Central CAS PubMed Google Scholar
Klein R, Eddy S: RSEARCH: finding homologs of single structured RNA sequences. BMC Bioinformatics. 2003, 4: 44-10.1186/1471-2105-4-44.
Article PubMed Central PubMed Google Scholar
Gautheret D, Lambert A: Direct RNA motif definition and identification from multiple sequence alignments using secondary structure profiles. J Mol Biol. 2001, 313: 1003-11. 10.1006/jmbi.2001.5102.
Article CAS PubMed Google Scholar
Macke T, Ecker D, Gutell R, Gautheret D, Case D, Sampath R: RNAMotif - A new RNA secondary structure definition and discovery algorithm. Nucleic Acids Res. 2001, 29 (22): 4724-4735. 10.1093/nar/29.22.4724.
Article PubMed Central CAS PubMed Google Scholar
Gautheret D, Major F, Cedergren R: Pattern searching/alignment with RNA primary and secondary structures: an effective descriptor for tRNA. Comput Appl Biosci. 1990, 6 (4): 325-331.
CAS PubMed Google Scholar
RNABOB: a program to search for RNA secondary structure motifs in sequence databases. [http://selab.janelia.org/software.html],
Chang T, Huang H, Chuang T, Shien D, Horng J: RNAMST: efficient and flexible approach for identifying RNA structural homologs. Nucleic Acids Res. 2006, 34: W423-W428. 10.1093/nar/gkl231.
Article PubMed Central CAS PubMed Google Scholar
Dsouza M, Larsen N, Overbeek R: Searching for patterns in genomic data. Trends Genet. 1997, 13 (12): 497-498.
Article CAS PubMed Google Scholar
Grillo G, Licciulli F, Liuni S, SbisÃă E, Pesole G: PatSearch: A program for the detection of patterns and structural motifs in nucleotide sequences. Nucleic Acids Res. 2003, 31 (13): 3608-3612. 10.1093/nar/gkg548.
Article PubMed Central CAS PubMed Google Scholar
Billoud B, Kontic M, Viari A: Palingol: a declarative programming language to describe nucleic acids’ secondary structures and to scan sequence database. Nucleic Acids Res. 1996, 24 (8): 1395-1403. 10.1093/nar/24.8.1395.
Article PubMed Central CAS PubMed Google Scholar
Reeder J, Giegerich R: A graphical programming system for molecular motif search. Proceedings of the 5th international Conference on Generative Programming and Component Engineering. 2006, New York: ACM Press, 131-140.
Google Scholar
Meyer F, Kurtz S, Backofen R, Will S, Beckstette M: Structator: fast index-based search for RNA sequence-structure patterns. BMC Bioinformatics. 2011, 12: 214-10.1186/1471-2105-12-214.
Article PubMed Central CAS PubMed Google Scholar
El-Mabrouk N, Raffinot M, Duchesne JE, Lajoie M, Luc N: Approximate matching of structured motifs in DNA sequences. J Bioinform Comput Biol. 2005, 3 (2): 317-342. 10.1142/S0219720005001065.
Article CAS PubMed Google Scholar
Jiang T, Lin G, Ma B, Zhang K: A general edit distance between RNA structures. J Comput Biol. 2002, 9 (2): 371-388. 10.1089/10665270252935511.
Article CAS PubMed Google Scholar
Abouelhoda M, Ohlebusch E: Chaining algorithms for multiple genome comparison. J Discrete Algo. 2005, 3 (2-4): 321-341.
Article Google Scholar
Will S, Siebauer M, Heyne S, Engelhardt J, Stadler P, Reiche K, Backofen R: LocARNAscan: incorporating thermodynamic stability in sequence and structure-based RNA homology search. Algo Mol Biol. 2013, 8: 14-10.1186/1748-7188-8-14.
Article Google Scholar
Ukkonen E: Algorithms for approximate string matching. Inf Control. 1985, 64 (1-3): 100-118.
Article Google Scholar
Manber U, Myers E: Suffix arrays: a new method for on-line string searches. SIAM J Comput. 1993, 22 (5): 935-948. 10.1137/0222058.
Article Google Scholar
Abouelhoda M, Kurtz S, Ohlebusch E: Replacing suffix trees with enhanced suffix arrays. J Discrete Algo. 2004, 2: 53-86. 10.1016/S1570-8667(03)00065-0.
Article Google Scholar
Kärkkäinen J, Sanders P: Simple linear work suffix array construction. Proceedings of the 13th International Conference on Automata, Languages and Programming. 2003, Berlin - Heidelberg: Springer
Google Scholar
Puglisi SJ, Smyth W, Turpin A: The performance of linear time suffix sorting algorithms. DCC ’05: Proceedings of the Data Compression Conference. 2005, Washington: IEEE Computer Society, 358-367.
Chapter Google Scholar
Manzini G, Ferragina P: Engineering a lightweight suffix array construction algorithm. Algorithmica. 2004, 40: 33-50. 10.1007/s00453-004-1094-1.
Article Google Scholar
Fischer J: Wee LCP. Inf Proc Let. 2010, 110 (8-9): 317-320.
Article Google Scholar
Kasai T, Lee G, Arimura H, Arikawa S, Park K: Linear-time longest-common-prefix computation in suffix arrays and its applications. Proceedings of the 18th Annual Symposium on Combinatorial Pattern Matching. 2001, Berlin - Heidelberg: Springer, 181-192.
Chapter Google Scholar
Beckstette M, Homann R, Giegerich R, Kurtz S: Fast index based algorithms and software for matching position specific scoring matrices. BMC Bioinformatics. 2006, 7: 389-10.1186/1471-2105-7-389.
Article PubMed Central PubMed Google Scholar
Ukkonen E: On-line construction of suffix trees. Algorithmica. 1995, 14 (3): 249-260. 10.1007/BF01206331.
Article Google Scholar
Beckstette M, Homann R, Giegerich R, Kurtz S: Significant speedup of database searches with HMMs by search space reduction with PSSM family models. Bioinformatics. 2009, 25 (24): 3251-3258. 10.1093/bioinformatics/btp593.
Article PubMed Central CAS PubMed Google Scholar
Darty K, Denise A, Ponty Y: VARNA: Interactive drawing and editing of the RNA secondary structure. Bioinformatics. 2009, 25 (15): 1974-1975. 10.1093/bioinformatics/btp250.
Article PubMed Central CAS PubMed Google Scholar
Weinberg Z, Wang J, Bogue J, Yang J, Corbino K, Moy R, Breaker R: Comparative genomics reveals 104 candidate structured RNAs from bacteria, archaea, and their metagenomes. Genome Biol. 2010, 11 (3): R31-10.1186/gb-2010-11-3-r31.
Article PubMed Central PubMed Google Scholar
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.
Article PubMed Central CAS PubMed Google Scholar
Mathews DH, Turner DH: Prediction of RNA secondary structure by free energy minimization. Curr Opin Struct Biol. 2006, 16 (3): 270-278. 10.1016/j.sbi.2006.05.010.
Article CAS PubMed Google Scholar

Download references

Acknowledgements

This work was supported by basic funding of the University of Hamburg. We thank Steffen Dettmann for interesting discussions and algorithmic ideas that contributed to this work.

Author information

Authors and Affiliations

Center for Bioinformatics, University of Hamburg, Bundesstrasse 43, Hamburg, 20146, Germany
Fernando Meyer, Stefan Kurtz & Michael Beckstette

Authors

Fernando Meyer
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Kurtz
View author publications
You can also search for this author in PubMed Google Scholar
Michael Beckstette
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael Beckstette.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

FM and MB developed the algorithms. FM implemented the algorithms. SK implemented the chaining algorithms. MB initiated the project and provided supervision and guidance. All three authors contributed to the manuscript. All authors read and approved the final manuscript.

Electronic supplementary material

12859_2013_6021_MOESM1_ESM.pdf

Additional file 1: Supplemental material. Additional file 1 contains additional experiments, figures, and tables. (PDF 308 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Authors’ original file for figure 10

Authors’ original file for figure 11

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Meyer, F., Kurtz, S. & Beckstette, M. Fast online and index-based algorithms for approximate search of RNA sequence-structure patterns. BMC Bioinformatics 14, 226 (2013). https://doi.org/10.1186/1471-2105-14-226

Download citation

Received: 27 February 2013
Accepted: 11 July 2013
Published: 17 July 2013
DOI: https://doi.org/10.1186/1471-2105-14-226

Fast online and index-based algorithms for approximate search of RNA sequence-structure patterns

Abstract

Background

Results

Conclusions

Background

Methods

Preliminaries

Approximate matching of RNA sequence-structure patterns

Online approximate RNA database search for RSSPs: ScanAlign

Lemma 1

Proof

Faster online alignment with early-stop computation: LScanAlign

Lemma 2

Proof

Index-based search: LESAAlign

Enhanced index-based search: LGSlinkAlign

Example: searching for an RSSP with algorithm LGSlinkAlign

RNA secondary structure descriptors based on multiple ordered RSSPs

Results and discussion

Implementation and computational results

Comparison of running times

Scaling behavior of the online and index-based algorithms

Influence of stem and loop lengths on the search time

RNA family classification by global chaining of RSSP matches

Importance of structural constraints for RNA family classification

RaligNAtorsoftware package

Conclusions

Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Electronic supplementary material

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us