Fast online and index-based algorithms for approximate search of RNA sequence-structure patterns

Background It is well known that the search for homologous RNAs is more effective if both sequence and structure information is incorporated into the search. However, current tools for searching with RNA sequence-structure patterns cannot fully handle mutations occurring on both these levels or are simply not fast enough for searching large sequence databases because of the high computational costs of the underlying sequence-structure alignment problem. Results We present new fast index-based and online algorithms for approximate matching of RNA sequence-structure patterns supporting a full set of edit operations on single bases and base pairs. Our methods efficiently compute semi-global alignments of structural RNA patterns and substrings of the target sequence whose costs satisfy a user-defined sequence-structure edit distance threshold. For this purpose, we introduce a new computing scheme to optimally reuse the entries of the required dynamic programming matrices for all substrings and combine it with a technique for avoiding the alignment computation of non-matching substrings. Our new index-based methods exploit suffix arrays preprocessed from the target database and achieve running times that are sublinear in the size of the searched sequences. To support the description of RNA molecules that fold into complex secondary structures with multiple ordered sequence-structure patterns, we use fast algorithms for the local or global chaining of approximate sequence-structure pattern matches. The chaining step removes spurious matches from the set of intermediate results, in particular of patterns with little specificity. In benchmark experiments on the Rfam database, our improved online algorithm is faster than the best previous method by up to factor 45. Our best new index-based algorithm achieves a speedup of factor 560. Conclusions The presented methods achieve considerable speedups compared to the best previous method. This, together with the expected sublinear running time of the presented index-based algorithms, allows for the first time approximate matching of RNA sequence-structure patterns in large sequence databases. Beyond the algorithmic contributions, we provide with RaligNAtor a robust and well documented open-source software package implementing the algorithms presented in this manuscript. The RaligNAtor software is available at http://www.zbh.uni-hamburg.de/ralignator.

http://www.biomedcentral.com/1471-2105/14/226 but share by far more specific structural and functional properties. Consequently, methods for effective RNA homology search (i.e. finding new members of an RNA family) cannot rely on sequence similarity alone, but also have to take structural similarity into account.
In this paper, we address the problem of searching nucleotide databases for occurrences of RNA family members. Since for this task it is not sufficient to rely on pure sequence alignment, we briefly review search methods that employ sequence and structure information.
There exist various general sequence-structure alignment tools which determine structural similarities that are too diverse to be alignable at the sequence level. Such tools can roughly be divided into two classes. The first class consists of tools that align RNAs with given structures or determine a common structure during the alignment process. Tools like MARNA [3] and RNAforester [4] require an a priori known secondary structure for both input RNAs. However, they suffer from the low quality of secondary structure prediction. Addressing this problem, other tools implement variations of the Sankoff algorithm [5], which provides a general but computationally demanding solution to the problem of simultaneously computing an alignment and the common secondary structure of the two aligned sequences. Unfortunately, even tools with improved running times using variations of this algorithm (LocARNA [6], Foldalign [7,8], Dynalign [9,10]) or heuristics [11] are simply not fast enough for rapid searches in large nucleotide databases. Hence, in a second class we identify more specialized tools for searching RNA families in nucleotide databases. These tools use a model or motif descriptors (i.e. patterns) defining consensus sequence and secondary structure properties of the families to be searched for. For example, Infernal [12] and RSEARCH [13] infer a covariance model from a given multiple sequence alignment annotated with structure information. This model can then be used to search sequence databases for new family members. Another tool, ERPIN [14] is also based on automatically generated statistical secondary profiles. Although being very sensitive in RNA homology search, in particular Infernal and RSEARCH suffer from high computational demands. An alternative are tools like RNAMotif [15], RNAMOT [16], RNABOB [17], RNAMST [18], PatScan [19], PatSearch [20], or Palingol [21]. These methods use user-defined motif descriptors created from a priori knowledge about the secondary structure of the described RNA family. Another tool, Locomotif [22], generates a thermodynamic matcher program from a pattern drawn interactively by the user via a graphical interface. Although these tools based on motif descriptors are faster than the previously mentioned tools, they have a running time that scales at least linearly with the size of the target sequence database. This makes their application to large databases challenging. Previously, we addressed this problem by presenting Structator [23], an ultra fast index-based bidirectional matching tool that achieves sublinear running time by exploiting base pair complementarity constraints for search space reduction.
Apart from running time constraints, another major disadvantage of all current tools that search for sequencestructure patterns is their limited capacity to find approximate matches to the patterns. Although variability in length of pattern elements is often allowed, this is constrained to certain pattern positions that must be specified by the user. This limitation also holds for our Structator tool. Also, variations (insertions, deletions, or replacements) in the sequence that lead to small structural changes, such as the breaking of a base pair, are not supported. This often hampers the creation of patterns that are specific but generalized enough to match all family members. An algorithm presented in [24] only partially alleviates this problem by finding approximate matches of a helix in a genome allowing edit operations on single bases, but not on the structure.
To overcome these issues, we present new fast indexbased and online algorithms for approximate matching of sequence-structure patterns, all implemented in an easy-to-use software package. Given one or more patterns describing any (branching, non-crossing) RNA secondary structure, our algorithms compute alignments of the complete patterns to substrings of the target sequence, i.e. semi-global alignments, taking sequence and structure into account. For this, they apply a full set of edit operations on single bases and base pairs. Matches are reported for alignments whose sequence-structure edit cost and number of insertions and deletions do not exceed user-defined thresholds. Our most basic algorithm is a scanning variant of the dynamic programming algorithm for global pairwise sequence-structure alignment of Jiang et al. [25], for which no implementation was available. Because its running time is too large for database searches on a large scale, we present accelerated online and index-based algorithms. All our new algorithms profit from a new computing scheme to optimally reuse the required dynamic programming matrices and a technique to save computation time by determining as early as possible whether a substring of the target sequence can contain a match. In addition, our index-based algorithms employ the suffix array data structure compiled from the search space. This further reduces the running time.
As in [23], we also support the description of an RNA molecule by multiple ordered sequence-structure patterns. In this way, the molecule's secondary structure is decomposed into a sequence of substructures described by independent sequence-structure patterns. These patterns are efficiently aligned to the target sequences using http://www.biomedcentral.com/1471-2105/14/226 one of our new algorithms and the results are combined with fast global and local chaining algorithms [23,26]. This allows a better balancing of running time, sensitivity, and specificity compared to searching with a single long pattern describing the complete sequence and secondary structure.
Before we describe our algorithms, we formalize the approximate search problem with the involved sequencestructure edit operations. Then we present, step by step, two efficient online and two index-based matching algorithms. We proceed with a short review of the approach for computing chains of matches. Finally, we present several benchmark experiments. For S = uv, u and v ∈ A * , u is a prefix of S, and v is a suffix of S. The k-th suffix of S starts at position k, while the k-th prefix of S ends at k. For 1 ≤ k ≤ n, S k denotes the k-th suffix of S. For stating the space requirements of our index structures, we assume that |S| < 2 32 , so that sequence positions and lengths can be stored in 4 bytes.

Preliminaries
The secondary structure of an RNA molecule is formed by Watson-Crick pairing of complementary bases and also by the slightly weaker wobble pairs. We say that two bases (c, d) ∈ A × A are complementary and can form a base pair if and only if (c, d) ∈ C = {(A, U), (U, A), (C, G), (G, C), (G, U), (U, G)}. If two bases a and b form a base pair we also say that there exists an arc between a and b. A non-crossing RNA structure R of length m is a set of base pairs (i, j), 1 ≤ i < j ≤ m, stating that the base at position i pairs with the base at position j, such that for all (i, j),

Approximate matching of RNA sequence-structure patterns
To find in a long RNA sequence S approximate matches of an RSSP Q describing a part of an RNA molecule, we compute alignments of the complete Q and substrings of S considering edit operations for unpaired bases and base pairs. That is, we compute semi-global alignments simultaneously obtaining the sequence-structure edit distance of Q and substrings of S.
We define the alignment of Q and a substring S[ p..q], .q] of match edges satisfies that, for all different (k, l), (k , l ) ∈ A match , k > k implies l > l . The set A gap of gap edges is defined as   Figure 1 for an example of a semi-global alignment and associated alignment edges. The alignment cost is based on a sequence-structure edit distance. The allowed edit operations on unpaired bases , and base deletion (insertion), with cost ω d , which occurs if (k, −) ∈ A gap ((−, l) ∈ A gap ). The possible edit operations on base pairs were first introduced by Jiang et al. [25] and are defined as follows. Let (k 1 , k 2 ) be a base pair in R and l 1 and l 2 , p ≤ l 1 < l 2 ≤ q, be positions in S.
• An arc breaking, with cost ω b , occurs if (k 1 , l 1 ) ∈ A match and (k 2 , l 2 ) ∈ A match but bases ). To give an example, consider the semi-global alignment in Figure 1. RSSP Q contains base pair (5,9) ∈ R and there exist edges (5, 11) ∈ A match and (9, 16) ∈ A match but S [ 11] = G and S [ 16] = G are not complementary. We note a difference between our definition and the definition of Jiang et al., where both aligned sequences are annotated with structure information. There, an arc breaking occurs if bases S[ l 1 ] and S[ l 2 ] are annotated as unpaired in addition to the condition of existing edges (k 1 , l 1 ) ∈ A match and (k 2 , l 2 ) ∈ A match . Hence, because in our case sequence S has no structure annotation, our definition is based on the complementarity of bases S[ l 1 ] and S[ l 2 ].
• An arc altering, with cost ω a , occurs if either (1) (k 1 , l 1 ) ∈ A match and (k 2 , −) ∈ A gap or (2) (k 2 , l 2 ) ∈ A match and (k 1 , −) ∈ A gap . Each case induces an additional base mismatch cost ω m if As an example, observe in the alignment shown in Figure 1 that there exist a base pair (11,16) ∈ R and edges (11, −) ∈ A gap and (16, 21) ∈ A match . • An arc removing, with cost ω r , occurs if (k 1 , −) ∈ A gap and (k 2 , −) ∈ A gap . As an example, observe in the alignment in Figure 1  ω d base deletion An alignment A of minimum cost between Q and S[ p..q] is an optimal alignment of Q and S[ p..q].
In practice, one is often interested in finding substrings of an RNA sequence S having a certain degree of similarity to a given RSSP Q on both the sequence and structure levels. Therefore, we are only concerned about optimal alignments of Q and substrings S[ p..q] with up to a userdefined sequence-structure edit distance and a limited number of allowed insertions and deletions (indels). More precisely: • the cost dist(Q, S[ p..q] ) should not exceed a given threshold K, and • the number of indels in the alignment should be at most d.
Thus, the approximate search problem for finding occurrences of an RSSP Q in S, given user-defined thresholds K and d, is to report all intervals [ p.
.q] such that We call every substring S[ p..q] satisfying Equation (3) a match of Q in S. In the subsequent sections we present algorithms for searching for matches of an RSSP Q in a sequence S.

Online approximate RNA database search for RSSPs: ScanAlign
A straightforward algorithm to search for approximate matches of an RSSP Q in an RNA sequence S consists of sliding a window of length m = m + d along S while computing dist(Q, S[ p..q] ) for 1 ≤ p ≤ q ≤ n and q − p + 1 = m . We note that, although the length of a match can vary in the range m−d to m+d, to find matches of all possible lengths it suffices to slide a window of length http://www.biomedcentral.com/1471-2105/14/226 m along S corresponding to substrings S[ p..q]. This holds because the alignment to a window of length m entails all possible alignments with up to d allowed indels. In the following we present a dynamic programming algorithm computing dist(Q, S[ p..q] ) for every window S[ p..q]. Our recurrences are derived from the algorithm for global pairwise sequence-structure alignment of Jiang et al. [25], i.e. an algorithm for aligning sequences of similar lengths. Although Jiang's algorithm supports the sequencestructure edit operations described above, we emphasize that it is not suitable for computing semi-global alignments, which is what we are interested in.
We begin the description of our algorithm by defining three functions required by the dynamic programming recurrences. Let T = S[ p..q].
1. For computing base match and mismatch costs for positions i and j of the RSSP Q = (P, R) and substring T, respectively, we define a function χ : N × N → {0, 1} as: 2.
To determine whether an arc breaking operation can occur, we must also be able to check for base complementarity at positions i and j of T. Therefore, we define a function comp : N × N → {0, 1} as: 3. For determining the correct row (of the dynamic programming matrices introduced below) where certain operation costs must be stored we introduce a function row : N → N defined as: Intuitively, function row satisfies the following: (1) given the right index i of a base pair (i , i), it returns the left index i if (i , i) is preceded or followed by other structures; (2) given the left index i of a base pair (i, i ), it returns 0 if the base at position i + 1 of Q is unpaired; and (3) given any other position index i, it returns i itself.
Using these three functions, our algorithm determines the sequence-structure edit distance dist(Q, T[ 1..m ] ) by computing a series of m + 1 (m + 1) × (m − k + 1) matrices DP k , for 1 ≤ k ≤ m + 1, such that DP 1 (row(m), m ) = dist(Q, T[ 1..m ] ). We remark that DP k (i, j) is not defined for every subinterval [ i..j]. While the recurrences of Jiang's algorithm are divided in four main cases, we present a simplified recurrence relation with only two main cases. In addition, we observe that we use only three indices for a matrix entry instead of four. Our recurrences are as follows.
A natural way to compute these DP matrices is top down, checking whether case 1, 2(a), or 2(b) applies, in this order. Due to the matrix dependencies in cases 2(a) and (b), the matrices need to be computed simultaneously.
Note that for all j, 1 ≤ j ≤ m , clearly DP 1 (row(m), j) = dist(Q, T[ 1..j] ). Therefore all candidate matches shorter than m beginning at position p are also computed in the computation of dist(Q, T [ 1..m ] ). The following Lemma is another important contribution of this work and also the key for the development of an efficient algorithm. 0 ≤ l < m , only differs by its last character from S[ p + l + 1..q + 1] which are suffixes of the window substring shifted by one position to the right, the lemma holds.

Lemma 1. When sliding a window along S to compute
Due to Lemma 1, our algorithm computes only the last column of the DP matrices for every shifted window substring (see the example in Figure 2) and just for the first window S[ 1..m ] it computes every column. We call this algorithm ScanAlign. We note that during the reviewing process of this manuscript, Will et al. [27] submitted and published an algorithm for semi-global sequence-structure alignment of RNAs. As our method, this algorithm saves computation time by reusing entries of dynamic programming tables while scanning the target sequence.
Our ScanAlign algorithm has the following time complexity: computing DP k (i, j) in Since there are n − m − 1 window shifts, the computation for all shifted windows takes O(mm 2 (n − m )) = O(mm 2 n) time. We observe that the time needed by ScanAlign to compute all window shifts reduces to O(mm n) if recurrence case 2(b) is not required. This is the case if the structure of Q does not contain unpaired bases before a base pair constituting e.g. a left dangling end or left bulge.

Faster online alignment with early-stop computation: LScanAlign
Often, before completing the computation of the alignment between an RSSP Q and a window substring S[ p..q] of the searched RNA sequence, we can determine whether the cost of this alignment will exceed the cost threshold K. By identifying this situation as early as possible, we can improve algorithm ScanAlign to skip the window, thus saving computation time and proceed with aligning the next window. The idea consists in checking, during the alignment computation, whether the cost of an already aligned region of Q and a substring of S[ p..q] exceeds K. In such a case, the alignment cost of the complete Q and S[ p.
.q] will also exceed K. In more detail, this works as follows.
• We decompose the RSSP Q into regions that can themselves represent a pattern, e.g. a stem-loop or unpaired region. A basic constraint is to not split base pairs to different regions. • We compute the alignment of a given initial RSSP region and a substring of the current window S[ p..q], progressively extending the alignment to other regions. • If the cost of aligning an RSSP region to a substring of the window exceeds cost threshold K, then the entire pattern cannot match the window. This means that the window can immediately be skipped.
Formally, a valid RSSP region Q[ x..y], 1 ≤ x ≤ y ≤ m, satisfies exactly one of the following conditions.

y forms a base pair
Examples of such RSSP regions are shown in red in Note that regions can be embedded in other regions but cannot partially overlap another.
Our progressive alignment computation of an RSSP Q and a window substring of the searched RNA sequence S begins by considering only an in general small region of Q embedded in another region. The computation is then extended to a surrounding region, e.g. from region Q[ 6. .y ] for some y < x and x < y. Therefore, they can easily be sorted to indicate the order by which the rows of the DP matrices are computed. We observe that the top-down computation of the DP matrices, as described above, automatically sorts the regions and respects the dependency between rows. To obtain from the sorted regions the indices of the rows to be computed, we consider the condition satisfied by each region. The rows obtained according to each condition are computed according to one case of the recurrence. Given region Q[ x..y] identified by one of the four conditions this region satisfies, the following rows of the matrices have to be computed.

All rows in the interval [ x..y] are computed by
Equation (7).   The sequential computation of the rows belonging to each region naturally leads to the computation of the entire alignment of Q and sequence-structure edit distance dist(Q, T). Our improvement of the ScanAlign algorithm is based on the following two observations.
• The standard dynamic programming algorithm for aligning two plain text sequences of lengths m and n requires an (m + 1) × (n + 1) matrix. Let i and j be indices of each of the matrix dimensions and a diagonal v be those entries defined by i and j such that Given that the cost of each edit operation is a positive value, the cost of the entries along a diagonal of the matrix are always non-decreasing [28]. • Moreover, one indel operation implies that an optimal alignment path including an entry on diagonal v also includes at least one entry on diagonal v + 1 or v − 1. Now let v be the diagonal ending at the entry on the lower-right corner of the matrix and d be the number of allowed indels. One can stop the alignment computation as soon as all the entries of one row in the matrix and along diagonals v For our improvement of algorithm ScanAlign, based on the following Lemma, we define a diagonal for each RSSP region instead of only one for the entire matrices.
Proof. If the RSSP region Q[ x..y] originates from condition 1 or 2 (3 or 4) above, we define the entries on a diagonal e as those entries Without loss of generality let d = 1. Assuming x − 1 > 0 and y + 1 ≤ m , this means that an optimal alignment of pattern Q and substring T requires Q[ x..y] to align with: .y+1] because these substrings already imply alignments with one indel. As the sum of the cost of the edit operations on the minimizing path increases monotonically and there cannot be other minimizing paths due to the limited number of indels d, the lemma holds.
Let Q be an RSSP whose regions are sorted by the order of computation of their respective rows in the DP tables above, let d be the number of allowed indels, and T = T[ 1..m ] be a window substring of the searched RNA sequence. Applying Lemma 2, we modify algorithm ScanAlign to compute the alignment of each region

and progressively extend the alignment to other RSSP regions and substrings of T as long as
holds. That is, for each RSSP region, it determines the rows and recurrence case required for their computation according to conditions 1, 2, 3, or 4 above. Then, within each processed row i, it checks whether for at least one entry DP k (i, j) on a possible minimizing path, i.e. on diagonals e , e − d ≤ e ≤ e + d, DP k (i, j) ≤ k. If no entry is below K, it skips the alignment computation for all remaining RSSP regions and proceeds with aligning the next window. See Figure 2 for an example of the DP matrices of an alignment computation whose entries on a possible minimizing path are highlighted in yellow.
When scanning the searched RNA sequence, a window can be shifted before all DP matrices entries are computed. Hence, a direct application of Lemma 1 is no longer possible. To overcome this, we define an array Z in the range 1 to z, where z is the number of RSSP regions, and associate each region with an index r, 1 ≤ r ≤ z. Let p be the starting position of the window substring S[ p..q] in the RNA sequence. We set Z[ r] = p whenever all DP matrices rows and columns belonging to region r are computed. This occurs when the cost of aligning this region does not exceed cost threshold K. Now, when aligning the same RSSP region r to a different window substring S[ p ..q ], p > p, computing all DP matrices columns requires to compute the last p − p columns. If p − p < m (recall that m = q − p = q − p ), this means that the two window substrings do not overlap and therefore no DP matrix column can be reused. http://www.biomedcentral.com/1471-2105/14/226 Our improved algorithm, hereinafter called LScanAlign, in the worst case needs to process every RSSP region for every window shift. Hence, it has the same time complexity as algorithm ScanAlign. However, as in many cases only a few RSSP regions are evaluated, it is much faster in practice as will be shown later. ScanAlign and LScanAlign are the basis for further improvements presented in the subsequent sections.

Index-based search: LESAAlign
Suffix trees and enhanced suffix arrays are powerful data structures for exact string matching and for solving other string processing problems [29,30]. In the following we show how the use of enhanced suffix arrays leads to even faster algorithms for searching for matches of an RSSP Q in an RNA sequence S.
The enhanced suffix array of a sequence S is composed of the suffix array suf and the longest common prefix array lcp. Let $, called terminator symbol, be a symbol not in A for marking the end of a sequence. $ is larger than all the elements in A. suf is an array of integers in the range 1 to n + 1 specifying the lexicographic order of the n + 1 suffixes of the string S$. That is, S suf [1] , S suf [2] , ..., S suf[n+1] is the sequence of suffixes of S in ascending lexicographic order. Table suf requires 4n bytes and can be constructed in O(n) time and space [31]. In practice non-linear time construction algorithms [32,33] are often used as they are faster. lcp is a table in the range 1 to n + 1 such that lcp[ 1] = 0, and lcp[ i] is the length of the longest common prefix between S suf[i −1] and S suf[i] for 1 < i ≤ n + 1. Table lcp requires n bytes and stores entries with value up to 255, whereas occasional larger entries are stored in an exception table using 8 bytes per entry [30]. More space efficient representations of the lcp table are possible (see [34]). The construction of table lcp can be accomplished in O(n) time and space given suf [35]. For an example of an enhanced suffix array, see Figure 4. In the following we assume that the enhanced suffix array of S has already been computed.
Consider an RSSP Q to be matched against an RNA sequence S with up to d indels.
When searching for matches of Q in S, we observe that algorithms ScanAlign and LScanAlign scan S computing dist(Q, S[ p..q] ) for every window substring of length q−p+1 = m+d. In the suffix array, each substring To match Q in S using a suffix array, we simulate a depth first traversal of the lcp interval tree [30] of S on the enhanced suffix array of S such that the reading depth of each suffix is limited by p i . That is, we traverse the suffix array of S top down, computing the sequence- . We recall that candidate matches of Q have length between m−d and m+d and that p i ≤ m+d. In case p i < m−d, we can skip S suf [i] . Also, remember that all candidate matches shorter than p i are obtained as a by-product of   Figure 5. We observe that if p i ≤ lcp, no DP entry needs to be recomputed. In this case, two situations arise: These situations allow LESAAlign to benefit from the enhanced suffix array in a second important way. That is, it skips all suffixes sharing a common prefix of at least length lcp with S suf[i −1] . To find the index j of the last suffix S suf[j] to be skipped, it suffices to look for the largest j such that min{lcp . We note that suffixes can also be efficiently skipped using so-called skip-tables as described in [36]. However, to save the 4n additional bytes required to store such tables we do not use them here. Our algorithm continues the top-down traversal of the suffix array with suffix S suf[j+1] , taking into account that the DP tables were last computed for S suf[i −1] . Consequently, the length of the longest common prefix between S suf[i−1] and S suf[j+1] to be considered in the processing of We also incorporate in our index-based algorithm the early-stop alignment computation scheme of algorithm LScanAlign. This allows to skip suffixes S suf[i] as soon as it becomes clear that the sequence-structure edit distance of RSSP Q and S suf[i] up to reading depth p i will exceed the cost threshold K. For this, LESAAlign progressively aligns regions of Q to a substring of the current suffix as in algorithm LScanAlign, checking whether the cost of each subalignment remains below the cost threshold K, thus applying Lemma 2. If the cost exceeds K, the alignment computation of the remaining pattern regions is skipped and the algorithm proceeds with processing the next suffix. To avoid recomputing as many entries of the DP matrices as possible while traversing the suffix array, LESAAlign differs from LScanAlign in the way it manages (non-) aligned regions for each suffix. Lemma 1, which algorithm LScanAlign applies to support early-stop computation, relies on scanning the searched RNA sequence S and overlapping window substrings. This makes it unsuitable for use with the suffix array. Instead, LESAAlign only uses information from the lcp table as follows. Let z be the number of regions of Q indexed from 1 to z and T = S suf[i] [ 1..p i ] be the current substring. When progressively aligning the regions of Q to a substring of T, we store the index r of the first region whose alignment cost exceeds K, if there is any. That is, for the first region Q[ x..y] whose index r we store, it holds that for every d , Lemma 2). Then, when aligning Q to a subsequent substring S suf[j] [ 1..p j ], we must distinguish the regions of Q previously computed from regions not computed.
• Previously computed pattern regions are all regions whose index is strictly smaller than r. The alignment computation of these regions profits from the common prefix between S  We observe that longer ranges of suffixes not containing matches to Q can be skipped thanks to the early-stop alignment computation scheme. Note that the left-most character of T needed to assert dist(Q[ x..y] , T x+d [ 1..
can match Q and thus can be skipped in the top-down traversal of the suffix array of S. Because in most cases y + d < p i , more suffixes are likely to share a prefix of length y + d than of length p i with S suf [i] . For the pseudocode of algorithm LESAAlign, see Section 1 of Additional file 1.

Enhanced index-based search: LGSlinkAlign
Given an RSSP Q to be searched in an RNA sequence S, algorithm LESAAlign is very fast when it can • avoid recomputation of DP matrices columns due to a common prefix between suffixes of S ; and • skip long ranges of suffixes of the suffix array suf whose common prefix up to a required reading depth are known to match or not match Q. . Despite this common prefix, still 182/252 ≈ 72% of the DP matrices entries need to be computed (disregarding initialization rows and columns 0) in case no early-stop is possible, i.e. in case K > 4. This is more than the at most 56/252 ≈ 22% of the DP matrices entries computed by the online algorithm LScanAlign for a window shift. Our next goal is to develop an algorithm traversing the enhanced suffix array of S that: 1. can skip more suffixes; and 2. improves the use of already computed DP matrices entries, reusing computed entries for as many suffixes as possible.
To address the first goal, we motivate our method by recalling the alignment computation example in .y] as a prefix. We begin by locating one such suffix, in particular the suffix of index suf[ j] that contains all but the first x = x − 1 characters of S suf[i] , i.e. suffix S suf[j] = S suf[i]+x . We determine j using a generalization of a concept originated from suffix trees. It is a property of suffix trees that for any internal node spelling out string T there is also an internal node spelling out T 2 whenever |T| > 1 [37]. A pointer from the former to the latter node is called a suffix link. In the case of suffix arrays, a suffix link can be computed using the inverse suffix array suf −1 of S$. suf −1 is a holds. However, we also want to be able to determine j for any x ≥ 1. The obvious solution is to compute suffix links x successive times. Each suffix link skips the first character of the previously located suffix. For a more efficient solution, we generalize suffix links to point directly to the suffix without a prefix of any length x of the initial suffix. For this purpose we define a function link : N × N → N as: To skip these suffixes not containing matches to Q in the top-down traversal of the suffix array suf, we mark their positions as true (for already"processed") in a bit array vtab of n bits. The suffix array traversal proceeds from position suf[ i], but skips the marked suffixes when their positions are reached. We remark that the described method for skipping suffixes can profit from a resorting according to the order by which RSSP regions are aligned. In the alignment computation example in Figure 2 We now address the second goal, namely reusing computed DP matrices entries for as many suffixes as possible. Recall that computing the sequence-structure edit distance dist(Q, S suf[i] [ 1..p i ] ) for each suffix S suf[i] up to reading depth p i means computing p i +1 DP matrices, one for each suffix T k of string T = S suf[i] [ 1..p i ], 1 ≤ k ≤ m , and one for the empty sequence ε. Observe that each suffix T k , T k = T, also occurs itself as a prefix of a suffix in table suf, i.e. there exists a suffix S suf[j] shorter than S suf[i] by exactly k − 1 characters which has prefix T k . Consequently, T k is processed again in an alignment to RSSP Q at a different point in time during the traversal of suf. Let T = S suf[j] [ 1..p j ]. Now note that if T is at a (nearly) contiguous position in suf to T, T and T are likely to share a common prefix due to their similar lexicographic ranking. This allows algorithm LESAAlign to avoid recomputation of DP matrices columns by using information from the lcp table. Unfortunately, T and T can be lexicographically ranked far away from each other in table suf, meaning that the DP matrices computed for T either: • were already computed once because T is lexicographically smaller than T, but were discarded to allow the processing of other suffixes until T was traversed; or • are computed for the first time otherwise, but will not be reused to also allow the processing of other suffixes until T occurs in table suf as a prefix of a suffix itself.
In both cases, redundant computations occur. To avoid this, we optimize the use of computed DP matrices by processing T directly after processing T for fixed k = 2, recalling that T = S suf  (i, 1). Also, k = 2 implies that T only differs by its last character from T, aside from not beginning with character T [ 1]. Therefore, to determine dist(Q, T ), we only have to compute the last column of the DP matrices required to compute dist(Q, T) as shown by Lemma 1. We note that, because i and j are not necessarily contiguous positions in suf, we mark the processed suffix S suf[j] in the bit array vtab so that it is only processed once. If no match to RSSP Q begins at position suf[ j], we also mark and skip every suffix sharing the substring with T whose alignment to a region of Q is known to exceed threshold K. Once T is processed and all possible suffixes are skipped, we recursively repeat this optimization scheme by setting T = T and processing the next T = S suf[j ] [ 1..p j ] where j = link(j, 1). The recursion stops when p j < m − d, meaning that T is too short to match Q, or when suf[ j ] is already marked as processed in vtab. The suffix array traversal proceeds at position i + 1 repeating the entire scheme. We call our algorithm incorporating the presented improvements LGSlinkAlign. For its pseudocode, see Section 1 of Additional file 1.
LGSlinkAlign inherits all the improvements of the above presented algorithms. In summary, its improvements are as follows.
• LGSlinkAlign traverses the enhanced suffix array of the searched sequence S, i.e. the suffix array suf enhanced with tables lcp and suf −1 . During this traversal, it benefits from common prefixes shared among suffixes to (1) avoid the computation of DP matrix columns and to (2) skip ranges of suffixes known to match or not match RSSP Q as in algorithm LESAAlign. • The suffix array traversal is predominantly top down, but non-contiguous suffixes are processed to optimize the use of computed DP matrices. • LGSlinkAlign stops the alignment computation as early as the alignment cost of a region of RSSP Q and a substring of the prefix of the current suffix exceeds threshold K, an improvement first introduced in algorithm LScanAlign. • Due to the early-stop computation scheme, suffixes sharing common prefixes shorter than m + d can be http://www.biomedcentral.com/1471-2105/14/226 skipped, leading to larger ranges of skipped suffixes. The early-stop computation scheme also helps to identify and skip non-contiguous suffixes sharing a common substring which is not their prefix.

Example: searching for an RSSP with algorithm LGSlinkAlign
We elucidate the ideas of algorithm LGSlinkAlign with the following example. Consider the RSSP Q = (AAGUUUC, ..(...)) to be matched in the sequence S whose enhanced suffix array is shown in Figure 4. To keep the example simple, we only allow a small cost threshold and number of indels, i.e. we set K = d = 1. The costs of the edit operations are ω d = ω m = ω b = ω a = 1 and ω r = 2. LGSlinkAlign to reuse as many computed DP matrices entries as possible continues processing other suffixes which are located by iteratively following the suffix links. It locates suffixes S suf [8] , S suf [4] , S suf [18] , and S suf [19] because link(12, 1) = 8, link(8, 1) = 4, link(4, 1) = 18, and link(18, 1) = 19, respectively. These suffixes are processed analogously as above, one after the other, not resulting in matches to Q. The iteration then leads to suffix S suf [20] , since link(19, 1) = 20. However, |S suf [20] | < m − d, meaning that this suffix is too short to contain a match to Q. This causes the iteration to stop. The suffix array traversal proceeds and repeats the entire matching scheme from the suffix that follows the last processed suffix not located via a suffix link, i.e. suffix S suf [2] . After processing and skipping all possible suffixes, we note that LGSlinkAlign does not report any matches for the defined cost threshold and allowed number of indels K = d = 1. By setting K = 5, it reports a match at position 16.

RNA secondary structure descriptors based on multiple ordered RSSPs
RNAs with complex branching structures often cannot be adequately described by a single RSSP due to difficulties in balancing sensitivity, specificity, and reasonable running time of the used search algorithm. Although their description by a single short RSSP specifying an unbranched fragment of the molecule might be very sensitive, it is often too unspecific and likely to generate many spurious matches when searching for structural homologs in large sequence databases or complete genomes. In contrast, using a single long RSSP often requires a higher cost threshold K for being sensitive enough which in turn, together with the increased RSSP length, has a negative influence on the search time. This might lead to disadvantageous running times in larger search scenarios in practice. We solve this problem by applying the powerful concept of RNA secondary structure descriptors (SSDs for http://www.biomedcentral.com/1471-2105/14/226 short) recently introduced in [23]. The underlying concept of SSDs is similar to the idea of PSSM family models [38], which are successfully used for fast and sensitive protein homology search. SSDs use the information of multiple ordered RSSPs derived from the decomposition of an RNA's secondary structure into stem-loop like structural elements. In a first step, approximate matches to the single RSSPs the SSD consists of are obtained using one of the algorithms presented above. From these matches, either local or global high-scoring chains are computed with the efficient chaining algorithms described in [23]. These algorithms take the chain's score, i.e. the weights of the fragments in the chain, into account (see [23] for details). For chaining of approximate RSSP matches, we use the fragment weight ω * Q − dist(Q, T) for an RSSP Q of length m matching substring T, where ω * Q = m * ω m +bps * ω r and bps denotes the number of base pairs in Q.
Here ω * Q is the maximal possible weighting Q can gain when being aligned and therefore it reflects the situation of a perfect match between Q and T. With this definition of a fragment's weight, a positive weight is always guaranteed, thus satisfying a requirement for the chaining algorithm. Once the chaining of matches to the RSSPs is completed, the high-scoring chains are reported in descending order of their chain score. By restricting to high-scoring chains, spurious RSSP matches are effectively eliminated. Moreover, the relatively short RSSPs used in an SSD can be matched efficiently with the presented algorithms leading to short running times that even allow for the large scale application of approximate RSSP search.

Implementation and computational results
We implemented (1) the fast index-based algorithms LESAAlign and LGSlinkAlign, (2) the online algorithms LScanAlign, ScanAlign, both operating on the plain sequence, and (3) the efficient global and local chaining algorithms described in [23]. In our experiments we use ScanAlign, which is the scanning version of the method proposed in [25], for reference benchmarking. All algorithms are included in the program RaligNAtor. The algorithms for index construction were implemented in the program sufconstruct, which makes use of routines from the libdivsufsort2 library (see http:// code.google.com/p/libdivsufsort/) for computing the suf table in O(n log n) time. For the construction of table lcp we employ our own implementation of the linear time algorithm of [35]. All programs were written in C and compiled with the GNU C compiler (version 4.5.0, optimization option -O3). All measurements are performed on a Quad Core Xeon E5620 CPU running at 2.4 GHz, with 64 GB main memory (using only one CPU core). To minimize the influence of disk subsystem performance, the reported running times are user times averaged over 10 runs. Allowed base pairs are canonical Watson-Crick and wobble, unless stated otherwise. The used sequence-structure operation costs are ω d = ω m = ω b = ω a = 1 and ω r = 2.

Comparison of running times
In a first benchmark experiment we measure the running times needed by the four algorithms to search with a single RSSP under different cost thresholds K and number of allowed indels d. We set (1)  LGSlinkAlign with sufconstruct and their storage on disk required 372 seconds. In the following we refer to this dataset as RFAM10.1 for short. In this experiment we use the RSSP tRNA-pat of length m = 74 shown in Figure 6 describing the consensus secondary structure of the tRNA family (Acc.: RF00005). The results of this experiment are presented in Figure 7 and Table S4, S5, and S6 of Additional file 1.
LGSlinkAlign and LESAAlign are the fastest algorithms.
LGSlinkAlign is faster in particular for increasing values of K and d, being only slower than LESAAlign for small values of K and d and for fixed d = 0. The advantage of LGSlinkAlign over LESAAlign with higher values of K and d is explained by the increased reading depth in the suffix array implicated by K and d and the fewer suffixes sharing a common prefix that can be skipped. This holds for both LGSlinkAlign and LESAAlign, however LGSlinkAlign counterbalances this effect by reusing computed DP matrices for non-contiguous suffixes of the suffix array. In a comparison to the two online algorithms considering only approximate matching, i.e. K ≥ 1, the speedup factor of LGSlinkAlign over ScanAlign (LScanAlign) is in the range from 560 for K = 1 and d = 0 to 17 for K = d = 6 (from 15 for K = 2 and d = 0 to 3 for K = d = 6). LESAAlign achieves a speedup factor over ScanAlign (LScanAlign) in the range from 1, 323 for K = 1 and d = 0 to 9 for K = d = 6 (29 for K = 1 and d = 0 to 1.6 for K = d = 6). In a comparison between the online algorithms, LScanAlign is faster than ScanAlign by up to factor 45 for K ≥ 1. In summary, all algorithms except ScanAlign profit from low values of K and d reducing their search times. This is a consequence of the use of the early-stop alignment computation scheme. As shown in Figure 7(2), also the number of allowed indels d influences the search time. For an additional experiment investigating the influence of K and d on the search time required by the four algorithms, see Section 2 of Additional file 1. A further experiment, described in Section 3 of Additional file 1, compares RaligNAtor and http://www.biomedcentral.com/1471-2105/14/226 .((((.........)))). (((((.......)))))..... (((((.......)))))))))))). the widely used tool RNAMotif [15] in terms of sensitivity and specificity in searches for the tRNA-pat depicted in Figure 6.

Scaling behavior of the online and index-based algorithms
In a second experiment we investigate how the search time of algorithms ScanAlign, LScanAlign, LESAAlign, and LGSlinkAlign scales on random subsets of RFAM10.1 of increasing size. The searched RSSPs flg1, flg2, and flg3 were derived from the three stem-loop substructures the members of family flg-Rhizobiales RNA motif (Acc.: RF01736) [40] fold into. These patterns differ in length, cost threshold K and number of allowed indels d; see Figure 8 for their definition, noting that K and d are simply denoted cost and indels in the RaligNAtor RSSP syntax. The results are shown in Figure 9 and LGSlinkAlign over ScanAlign on the smallest and the full subsets. Comparing the search time for pattern flg3 individually, the speedup of LGSlinkAlign over ScanAlign ranges from 22.6 to 38.8. We also observe that ScanAlign requires the longest time to match the longest pattern flg2 of length m = 37. The other algorithms profit from the early-stop computation approach to reduce the search time for this pattern on every database subset.

Influence of stem and loop lengths on the search time
When searching a database for matches of a given pattern, our algorithms compute the required DP matrices using recurrences according to two main cases: either a row corresponds to an unpaired or to a paired base of the pattern. To analyze the influence of the used recurrence on the search time of each algorithm, we search RFAM10.1 for artificial stem-loop patterns. Therefore we vary the number of bases in the loop of pattern Q = (NNNACANNN, (((...)))) from 3 to 12 by using As and Cs. Additionally, we vary the number of base pairs in the stem of pattern Q = (NNACANN, ((...))) from 2 to 11 by pairs of Ns. Matching the patterns in these two experiments means to increase the use of the DP recurrences in Equations (7) and (8) Figure 10. We observe that increasing the number of bases in the loop has little influence and even reduces the running time of the two fastest algorithms LGSlinkAlign and LESAAlign. This can be explained by the use of the early-stop alignment computation scheme in these algorithms. The reduction of the running time is explained by the fewer matches that need to be processed as the pattern gets longer and more specific. For an increasing number of base pairs in the stem, LGSlinkAlign is the least affected algorithm. We also observe that the linear increase in running time of the basic online algorithm ScanAlign, caused by an extension of the pattern by one base pair, is similar to the effect of adding two bases in the loop.
The same search is performed using the SSD of family RF01736. It results in 4, 145 matches for flg1, 68, 024 for flg2, and 67 for flg3. Chaining the matches leads to 15 chains occurring each in one of the 15 sequence members of the family in the full alignment. The highest (lowest) chain score is 163 (156). Using ScanAlign, LScanAlign, LESAAlign, and LGSlinkAlign, the search for all three RSSPs requires 755.48, 336.69, 133.58, and 52.86 minutes, respectively, whereas chaining requires 0.03 seconds. The time required to match each pattern using each algorithm is reported in Table S9 of Additional file 1.
We also show that the lack of the sequence-structure edit operations supported by RaligNAtor deteriorates sensitivity and specificity in the search for sequence members of families RF00458 and RF01736. For this, we report in Section 4 and Table S10 of Additional file 1 results obtained with the Structator tool [23]. Structator is much faster but, in contrast to RaligNAtor, does not support all sequence-structure edit operations.

Importance of structural constraints for RNA family classification
To assess the potential of using RSSPs for reliable RNA homology search on a broader scale and to investigate the effect of using base pairing information, we evaluated RaligNAtor on 35 RNA families taken from Rfam 10.1 with different degrees of sequence identity and of different sizes. See Table 1 for more information about the selected families. In our experiment, we compared (1) RaligNAtor results obtained by using RSSPs derived from Rfam seed alignments with (2) results obtained for the same RSSPs ignoring base pairing information and (3) results obtained by blastn [41] searches with the families' consensus sequence. For each selected family, we automatically compiled an RSSP Q = (P, R) from the family's seed alignment using the following procedure: at each position of the RSSP's sequence pattern P, we choose the IUPAC wildcard matching all symbols in the corresponding alignment column. As structure string R, we use the secondary structure consensus available in the Rfam seed alignment. a query sequence for blastn, we compute the consensus sequence from the family's seed alignment. Because blastn does not appropriately handle IUPAC wildcard characters in the query, we choose the most frequent symbol occurring in a column as representative symbol in the consensus sequence. For the RaligNAtor searches, we adjust the cost threshold K and number of allowed indels d such that we match the complete family. That is, we achieve a sensitivity of 100%. The used operation costs are ω d = ω m = 1, ω b = ω a = 2, and ω r = 3. For the Blast searches, we called blastn with parameters -m8 -b 250000 -v 250000 and a very relaxed E-value cutoff of 1000. From the two RaligNAtor and one blastn outputs we count the number of true positives (#TPs) and false positives (#FPs) and compute ROC curves on the basis of the RaligNAtor score ω * Q − dist(Q, T) and the blastn bit score. See Table 1 and Figure 11 for the results of this experiment. A ROC curve with values averaged over all families is shown in Figure 11(1).
In addition, we show in Figures 11(2) and (3) the results of the ROC analysis for the families with the lowest and highest degree of sequence identity. For the ROC curve of each selected family, see Figures S7 and S8 of Additional file 1. Clearly, by using base pairing information, RaligNAtor achieves a higher sensitivity with a reduced false positive rate compared to searches ignoring base pairing (compare columns "RaligNAtor" and "RaligNAtor (sequence only)" in Table 1). This is in particular evident when searching for families with a low degree of sequence identity. This can be explained by the small amount of information left in the RSSP for such a family, once the structural information is removed. Due to the high variability of bases in the columns of the multiple alignment of the family, the pattern contains a large http://www.biomedcentral.com/1471-2105/14/226 number of wildcards. These symbols alone, without the constraints imposed by the base pairs, lead to unspecific patterns and therefore to a large number of false positives. We observe that, for families with sequence identity of up to 59%, the area under the curve (AUC) is considerably larger when base pairing information is taken into account. This difference decreases with increasing sequence identity (compare Figures 11(2) and (3)). Overall, the average AUC value over all families is, with a value of 0.93, still notably higher when base pairing information is considered compared to 0.89 if base pairing information is ignored (see Table 1). In this experiment, blastn only finds all members of those families whose sequence identity is at least 85%. This is due to the fact that blastn cannot appropriately handle IUPAC wildcard characters. Hence, by taking the most frequent symbol in an alignment column as consensus symbol, the heterogeneity of less conserved positions in the alignment cannot be adequately modeled. For the blastn searches, the average AUC value over all families is only 0.72.

RaligNAtor software package
RaligNAtor is an open-source software package for fast approximate matching of RNA sequence-structure patterns (RSSPs). It allows the user to search target RNA or DNA sequences choosing one of the new online or further accelerated index-based algorithms presented in this work. The index of the sequence to be searched can be easily constructed with program sufconstruct distributed with RaligNAtor. Searched RSSPs can describe any (branching, noncrossing) RNA secondary structure; see examples in Figures 1, 6, 8, and S5 of Additional file 1. Bases composing the sequence information of RSSPs can be ambiguous IUPAC characters. As part of the search parameters for RSSPs, the user can specify the cost of each sequence-structure edit operation defined above, the cost threshold of possible matches, and the number of allowed indels. The RSSPs, along with costs and thresholds per RSSP, are specified in a simple text file using a syntax that is expressive but easy to understand as shown in the mentioned figures. Another possibility is to provide the same costs and thresholds for all searched patterns as parameters in the command line call to RaligNAtor. To ensure maximal flexibility, the user can also define the base pairing rules from an arbitrary subset of A × A as valid pairings in a separate text file. Searches can be performed on the forward and reverse strands of the target sequence. Searching on the reverse strand is implemented by reversal of the RSSP and transformation according to Watson-Crick base pairing. Wobble pairs {(G,U), (U,G)} automatically become {(C,A), (A,C)}. Due to these transformations, the index is built for one strand only.
For describing a complex RNA with our concept of secondary structure descriptor (SSD), i.e. with multiple RSSPs, the user specifies all RSSPs in one text file. The order of the RSSPs in the file will then specify the order of the RSSP matches used to build high-scoring chains. The chain score directly depends on the score of each match occurring in the chain. This is inversely proportional to the sequence-structure edit distance of the RSSP and its matching substring in the target sequence. Hence, higher scores indicate sequences with a higher conservation which are probably more closely related to the sought RNA family.
Chaining of matches discards spurious matches not occurring in any chain. An additional filtering option http://www.biomedcentral.com/1471-2105/14/226 Searches are performed using RaligNAtor with and without base pairing information (column "RaligNAtor (sequence only)") and using program blastn with the families' seed alignment consensus sequence as query. Column "size" indicates the number of members in a family. Column "seq. ident." gives the families' sequence identity as listed in the Rfam database. #TP and #FP stand for number of found true and false positives, respectively. AUC is the area under the curve of the corresponding ROC curves shown in Figures 11, S7,  eliminates matches overlapping another with a higher score for the same RSSP. This is particularly useful when indels lead to almost identical matches that are only shifted by a few positions in the target sequence. The output of RaligNAtor includes not only matching positions to single RSSPs and chains, but their sequencestructure alignment to the matched substrings as well. In the RaligNAtor software package, all programs for searching patterns support multithreading to take advantage of computer systems with multiple CPU cores. There are two modes of parallelism. At first, different patterns are searched using multiple threads. Additionally, the search space (i.e. the sequence for the online algorithms and the index structure for the index-based methods) is partitioned, processing each part using a different thread. Lastly, we remark that our software also provides an implementation of the original algorithm of Jiang et al. for global sequence-structure alignment [25], easily applicable by the user.

Conclusions
We have presented new index-based and online algorithms for fast approximate matching of RNA sequencestructure patterns. Our algorithms, all implemented in the RaligNAtor software, stand out from previous search tools based on motif descriptors by supporting a full set of edit operations on single bases and base pairs. See Table 2 for an overview of the algorithms. In each algorithm, the application of a new computing scheme to optimally reuse the entries of the required dynamic programming matrices and an early-stop technique to avoid the alignment computation of non-matching substrings http://www.biomedcentral.com/1471-2105/14/226 LGSlinkAlign 9.125n The two online algorithms ScanAlign and LScanAlign need no additional memory except for the searched sequence of length n. Column additional memory requirements refers to the additional memory needed by the used index tables. Recall that tables suf and suf −1 require 4n bytes each. Table lcp can be stored in 1n bytes and the bit array vtab requires only n bits (= 0.125n bytes).
led to considerable speedups compared to the basic scanning algorithm ScanAlign. Our experiments show superior performance of the index-based algorithms LGSlinkAlign and LESAAlign, which employ the suffix array data structure and achieve running time sublinear in the length of the target database. When searching for approximate matches of biologically relevant patterns on the Rfam database, LGSlinkAlign (LESAAlign) was faster than ScanAlign and LScanAlign by a factor of up to 560 (1,323) and 17 (29), respectively (see Figure 7). Comparing the two index-based algorithms, LESAAlign was faster than LGSlinkAlign when searching with tight cost threshold (i.e. sequence-structure edit distance) and no allowed indels, but became considerably slower when the number of allowed indels was increased. In this scenario, LGSlinkAlign was faster than LESAAlign by up to 4 times. In regard to the two online algorithms, LScanAlign was faster than ScanAlign by up to factor 45. In summary, LGSlinkAlign is the best performing algorithm when searching with diverse thresholds, whereas LScanAlign is a very fast and spaceefficient alternative. RaligNAtor also allows to use the powerful concept of RNA secondary descriptors [23], i.e. searching for multiple ordered sequence-structure patterns each describing a substructure of a larger RNA molecule. For this, RaligNAtor integrates fast global and local chaining algorithms. We further performed experiments using RaligNAtor to search for members of RNA families based on information from the consensus secondary structure. In these experiments, RaligNAtor showed a high degree of sensitivity and specificity. Compared to searching with primary sequence only, the use of secondary structure information considerably improved the search sensitivity and specificity, in particular for families with a characteristic secondary structure but low degree of sequence conservation. We remark that, up to now, RaligNAtor uses a relatively simple scoring scheme. By incorporating more fine grained scoring schemes like RIBOSUM [13] or energy based scoring [42], we believe that the performance of RaligNAtor for RNA homology search can be further improved.
Beyond the algorithmic contributions, we provide with the RaligNAtor software distribution, a robust, welldocumented, and easy-to-use software package implementing the ideas and algorithms presented in this manuscript.