Structator: fast indexbased search for RNA sequencestructure patterns
 Fernando Meyer^{1},
 Stefan Kurtz^{1},
 Rolf Backofen^{2},
 Sebastian Will^{2, 3}Email author and
 Michael Beckstette^{1}Email author
https://doi.org/10.1186/1471210512214
© Meyer et al; licensee BioMed Central Ltd. 2011
Received: 10 December 2010
Accepted: 27 May 2011
Published: 27 May 2011
Abstract
Background
The secondary structure of RNA molecules is intimately related to their function and often more conserved than the sequence. Hence, the important task of searching databases for RNAs requires to match sequencestructure patterns. Unfortunately, current tools for this task have, in the best case, a running time that is only linear in the size of sequence databases. Furthermore, established index data structures for fast sequence matching, like suffix trees or arrays, cannot benefit from the complementarity constraints introduced by the secondary structure of RNAs.
Results
We present a novel method and readily applicable software for time efficient matching of RNA sequencestructure patterns in sequence databases. Our approach is based on affix arrays, a recently introduced index data structure, preprocessed from the target database. Affix arrays support bidirectional pattern search, which is required for efficiently handling the structural constraints of the pattern. Structural patterns like stemloops can be matched inside out, such that the loop region is matched first and then the pairing bases on the boundaries are matched consecutively. This allows to exploit base pairing information for search space reduction and leads to an expected running time that is sublinear in the size of the sequence database. The incorporation of a new chaining approach in the search of RNA sequencestructure patterns enables the description of molecules folding into complex secondary structures with multiple ordered patterns. The chaining approach removes spurious matches from the set of intermediate results, in particular of patterns with little specificity. In benchmark experiments on the Rfam database, our method runs up to two orders of magnitude faster than previous methods.
Conclusions
The presented method's sublinear expected running time makes it well suited for RNA sequencestructure pattern matching in large sequence databases. RNA molecules containing several stemloop substructures can be described by multiple sequencestructure patterns and their matches are efficiently handled by a novel chaining method. Beyond our algorithmic contributions, we provide with Structator a complete and robust opensource software solution for indexbased search of RNA sequencestructure patterns. The Structator software is available at http://www.zbh.unihamburg.de/Structator.
Background
The discovery of new roles of noncoding RNAs (ncRNAs) has made them of central research interest in molecular biology [1, 2]. Like proteins, ncRNA sequences that have evolved from a common ancestor can be grouped into families. For instance, the Rfam database [3, 4] release 10.0 compiles 1,446 such families. Members of a family share, to different degrees, sequence and structure similarity. In many cases, however, the members of a family share only few sequence features, but share by far more specific structural and functional properties. Prominent examples of such cases are tRNAs and microRNA precursors.
In this paper, we consider the problem of searching nucleotide databases for occurrences of RNA family members. As sequence similarity is often remote even within wellestablished RNA families, we cannot rely on pure sequence alignment and related techniques for this task. Indeed, it has been shown that sequence alignments of structured RNAs fail at pairwise sequence identities below about 60% [5]. Therefore, we briefly review nucleotide database search methods that make use of sequence and structure information. There are general sequencestructure alignment tools, which determine structural similarities and derive consensus structure patterns for RNAs that are too diverse to be alignable at sequence level. We identify two classes of such tools. The first class, with RNAforrester[6] and MARNA[7] being the main representatives, require a known or predicted secondary structure for both sequences as input. However, they suffer from the low quality of secondary structure prediction, especially if the boundary of the RNA elements are not exactly known. The second class of methods are derivatives of the Sankoff algorithm [8], which provides a general solution to the problem of simultaneously computing an alignment and the common secondary structure of the two aligned sequences. Due to its high complexity ( time and memory) several variants of this approach have been introduced such as foldalign[9, 10], dynalign[11] and LocaRNA[12]. Still, these tools have a time complexity that is generally too high for a rapid database search. Thus, more specialized tools for searching RNA families in nucleotide databases have been introduced. Tools like RNAMotif[13], RNAMOT[14], RNABOB[15], RNAMST[16], PatScan[17], and PatSearch[18] are based on motif descriptors defining primary and secondary structure properties of the families to be searched for. They provide a language for defining descriptors and a method to search with these in large nucleotide databases. For these tools, the motif descriptor for a family has to be extracted externally from other information (such as a multiple sequencestructure alignment) about the specific RNA family. There are also tools that automatically derive descriptors from structureannotated sequences or a multiple sequence alignment of related RNA sequences such as Infernal[19, 20], RSEARCH[21], and PHMMTS[22]. They use variants of stochastic contextfree grammars as descriptors, whereas ERPIN[23] uses sequential and structural profiles. Despite being fast compared to other methods, descriptorbased tools available today have a running time that is, in the best case, linear in the size of the target sequence database. This makes their application challenging when it comes to large sequence databases. A solution with sublinear running time would require index data structures. However, widely used index structures like suffix trees [24] or arrays [25] or the FMindex [26] perform badly on typical RNA sequencestructure patterns, because they cannot take advantage of the RNA structure information. Here, we present a fast descriptorbased method and software for RNA sequencestructure pattern matching. The method consists of initially building an affix array [27], i.e. an index data structure of the target database. Affix arrays cope well with structural pattern constraints by allowing for an efficient matching order of the bases constituting the pattern. Structurally symmetric patterns like stemloops can be matched inside out, such that first the loop region is matched and, in subsequent extensions, pairing positions on the boundaries are matched consecutively. Because the matched substring is extended to the left and to the right, this pattern matching scheme is known as bidirectional search. Unlike traditional lefttoright search where the two substrings constituting the stem region of the pattern are matched sequentially, in bidirectional search, base complementarity constraints are checked as early as possible. This leads to a significant reduction of the search space that has to be explored and in turn to a reduced running time. We note that bidirectional search for RNA sequencestructure patterns was also presented by Mauri et al. in [28]. However, their method uses affix trees [29] instead of the more memory efficient affix arrays. Affix trees require with approximately 45 bytes per input symbol more than twice the memory of affix arrays (18 bytes per input symbol), making their application infeasible on a large scale. Moreover, their method traverses the affix tree in a breadthfirst manner, leading to a space requirement that grows exponentially with increasing reading depth. We instead employ a depthfirst search algorithm whose space requirement is only proportional to the length of the searched substring.
The affix array directly supports the search for sequencestructure patterns that describe sequencestructure motifs with nonbranching structure, for example stemloops. In contrast, e.g. the search for stems closing a multiloop is not directly supported. Nevertheless, even for RNA containing multiloops, the affix array can still speed up the search. Our general approach for finding RNA families with branching structure is to describe each stemloop substructure by a sequencestructure pattern. Each of these patterns is matched independently using the affix array. Then, with a new efficient chaining algorithm, we compute chains of matches such that the chained matches reflect the order of occurrence of the respective patterns in the molecule. Note that complex structures containing one or more multiloops can be expected to contain sufficiently many nonbranching patterns, such that the proposed chaining strategy identifies true matches with high specificity.
For a better understanding of the concepts underlying our method, we begin with formalizing RNA structural motifs. We then describe the concepts and ideas of affix arrays and show how to use them in an algorithm for fast bidirectional search for sequencestructure patterns. After presenting a detailed complexity analysis of the algorithm, we proceed with a detailed description and analysis of a novel method for computing chains of sequencestructure pattern matches. Finally, we benchmark and validate our method in several experiments.
Methods
Preliminaries
A sequence S of length n = S over an alphabet is a juxtaposition of n elements (characters) from the set . S[i], 0 ≤ i < n denotes the character of S at position i. Let ε denote the empty sequence, the only sequence of length 0. By we denote the set of sequences of length n ≥ 0 over . The set of all possible sequences over including the empty sequence ε is denoted by .
For a sequence S = S[0]S[1] ... S[n  1] and 0 ≤ i ≤ j < n, S[i..j] denotes the substring S[i]S[i + 1] ... S[j] of S. We denote the reverse sequence of S with S^{1} = S[n  1]S[n  2] ... S[0]. For S = uv, u and , u is a prefix of S, and v is a suffix of S. The kth suffix of S starts at position k, while the kth prefix of S ends at k. Note that the 0th suffix of S is S itself and that S[0] is the 0th prefix of S. The kth reverse prefix of S is the kth suffix of S^{1}. For 0 ≤ k < n, S_{ k } denotes the kth suffix of S, and , denotes the kth reverse prefix of S.
Let denote the RNA alphabet {A, C, G, U}. Its characters code for the nucleotides adenine (A), cytosine (C), guanine (G), and uracil (U). In the following we fix a sequence S over the RNA alphabet . For stating the space requirements of our index structures, we assume that S< 2^{32}, such that sequence positions and lengths can be stored in 4 bytes.
RNA structural motifs
A structure string H is a sequence over the alphabet {., (,) } with an equal number of characters (and ). There is a bijection between the set of (noncrossing) RNA structures R and the set of structure strings H, both of length m, such that for each base pair (i, j) ∈ R, H[i] = (and H[j] = ), and H[r] = . for positions r, 0 ≤ r < m, that do not occur in any base pair of R, i.e. r ≠ i ∧ r ≠ j for all (i, j) ∈ R. Due to this equivalence we identify both representations.
Let Φ = {R, Y, M, K, W, S, B, D, H, V, N} be a set of characters. The IUPAC nucleotide base code introduces the characters in Φ to code nucleotide ambiguity and assigns a specific character class to each . In particular, for and . A sequence pattern is a sequence . Let m denote its length P. An occurrence of P in a sequence S is a position i, 0 ≤ i < n, such that P[k] = S[i + k] with S[i + k] ∈ φ(P[k]) for all 0 ≤ k < m. An RNA sequencestructure pattern (RSSP) of length m is a pair of a sequence pattern P and a structure string R, both of length m. A match or occurrence of of length m in an RNA sequence S is an occurrence i of P in S, such that for all base pairs (l, r) ∈ R: S[i + l] and S[i + r] are complementary. Furthermore, define as a mapping of a character to the set of its complementary characters in , i.e. .
In this paper, structures described by RSSPs are nonbranching.
The affix array data structure
In [27] the theoretical concept of an index data structure called affix array is described. This index structure supports efficient unidirectional as well as bidirectional searches and is more space efficient than the affix tree [29, 30]. The term unidirectional search refers to the search for occurrences of a sequence pattern where the pattern characters are compared with sequence characters in a lefttoright (righttoleft) order, i.e. the already compared (matched) prefix (suffix), of the pattern is extended to the right (left). Notably, a change of the direction is not possible.
Until now, affix arrays have received little attention in bioinformatics. Presumably, this has been due to the lack of an open and robust implementation. As a consequence, their potential for efficient database search with RSSPs has hardly been recognized and the details of this data structure are not widely known in the field. Therefore, we briefly recall the basic ideas of the affix array, which constitutes the central component of our Structator approach.
 1.
lcp _{ X } [i] < ℓ;
 2.
lcp _{ X } [j + 1] < ℓ; and
 3.
lcp _{ X } [k] ≥ ℓ for all k ∈ {i + 1,..., j}.
We call a suffixinterval ℓ  [i..j] in suf_{ X }lcpinterval in suf _{ X } with lcpvalue ℓ ∈ {0,..., n}, or ℓinterval, if and only if i < j and lcp _{ X } [k] = ℓ for at least one k ∈ {i + 1,..., j}.
For a suffixinterval ℓ  [i..j] in suf _{ X } , we denote the common prefix of length ℓ of its suffixes by δ_{ X } (ℓ  [i..j]) = S^{ X } [suf _{ X } [i]..suf _{ X } [i] + ℓ  1]. In case of an lcpinterval ℓ  [i..j] in suf _{ X } , δ_{ X } (ℓ  [i..j]) is the longest common prefix of all suffixes in this interval.
In summary, a suffixinterval ℓ  [i..j] in suf _{ X } describes simultaneously:

A location in the index structure suf_{ X }by interval borders i and j and depth ℓ. For an example, see the yellow marked region in Figure 3 which corresponds to the suffixinterval 4  [4..6] in suf_{F}.

A (lexicographically ordered) sequence of suffixes . For an example, consider the lexicographically ordered sequence of suffixes in the suffixinterval 4  [4..6] in suf_{F} in Figure 3.

A substring of S^{ X }of length ℓ, namely δ_{ X }(ℓ  [i..j]). That is, for the suffixinterval 4  [4..6] in suf_{F} in Figure 3, δ_{F}(4  [4..6]) = CUGC.

The occurrences of this substring in S^{ X }, namely at positions suf_{ X }[i],..., suf_{ X }[j]. To give an example, consider Figure 3 and observe that substring CUGC occurs at positions suf_{F}[4] = 10, suf_{F}[5] = 7, and suf_{F}[6] = 4 in S^{F} = AUAGCUGCUGCUGCA.
For unidirectional lefttoright search of some pattern in S it is sufficient to process lcpintervals only in suf_{F}. For bidirectional pattern search using affix arrays, described in detail in the next section, we employ information from table suf_{F} as well as suf_{R}. Therefore, we need to associate information of one table to the other. This is done by linking intervals via tables aflk_{F} and aflk_{R}. We observe that there exists a mapping between lcpintervals in suf_{F} and suf_{R}. This is stated by the following proven lemma [27].
Lemma 1 For every lcpinterval q = ℓ  [i..j] in table suf_{ X }there is exactly one lcpinterval q^{1} = ℓ'  [i'..j'] in table called reverse lcpinterval of q, such that ℓ' ≥ ℓ and the ℓ  1th prefix of equals (δ_{ X }(q))^{1}. The number of suffixes (prefixes) represented by q and q^{1}are the same, i.e., j  i = j'  i'.
We note that the equivalence q = (q^{1})^{1} is not necessarily true. This is stated by the next lemma.
Lemma 2 If the lcpinterval q^{1}with depth ℓ' in is the reverse of the lcpinterval q with depth ℓ in suf _{ X } and ℓ = ℓ', then q = (q^{1})^{1}. Otherwise, if ℓ' > ℓ, then q ≠ (q^{1})^{1}.
where ℓ  [i..j] is an lcpinterval in suf _{ X } . Hence, the home position is one of two boundary positions. Strothmann [27] shows that home _{ X } ([i..j]) ≠ home _{ X } ([i'..j']) for different lcpintervals ℓ  [i..j] and ℓ'  [i'..j'].
Table aflk _{ X } of string S^{ X } $ with total length n + 1 can now be defined as a table in the range 0 to n such that aflk _{ X } [home _{ X } (q)] = i', where q is an lcpinterval in suf _{ X } and i' is the left border of the reverse interval q^{1} = [i'..j'] in . We refer to the entries in table aflk _{ X } as affix links. Tables aflk_{F} and aflk_{R} occupy 4n bytes each. They can be computed by traversing the lcpintervals in suf _{ X } while simultaneously looking for the corresponding reverse lcpintervals in . Locating reverse lcpintervals can be accelerated by skptables. These tables, introduced in Beckstette et al.[37] and hereinafter referred to as skp_{F} and skp_{R}, can be constructed in linear time [38] and allow one to quickly skip intervals in suf _{ X } (for details, see [37]).
The construction of tables aflk_{F} and aflk_{R} takes time. Although the use of skptables requires additional 2 × 4n bytes of memory, they considerably reduce the construction times of tables aflk_{R} and aflk_{R} in practice. We note that Strothmann [27] describes a linear time construction algorithm for tables aflk_{F} and aflk_{R}, which employs suffix link and childtables [34] and an additional table. Altogether these tables require together at least additional 7n bytes of space. Moreover, even without applying the skptable based acceleration, Strothmann states that the quadratic time construction algorithm is fast in practice. An example of the affix array for sequence S = AUAGCUGCUGCUGCA highlighted with some of its lcpintervals connected to the respective reverse interval via the aflk _{ X } table is shown in Figure 3.
Because affix links in table aflk _{ X } are only defined for lcpintervals but not suffixintervals in general, which we require in bidirectional search, we introduce the concept of affixintervals. Affixintervals are similar to affix nodes as defined in [27]. An affixinterval in suf _{ X } is a triple v = 〈k, q, X〉, where k is an integer designated context of v and q is a suffixinterval in suf _{ X } .
An affixinterval v = 〈k, q, X〉 in suf _{ X } , with q = ℓ  [i..j], ℓ > 0, m < k < ℓ, describes a substring ω_{ X } (v) of S^{ X } of length ℓ  k, defined as the kth suffix of δ_{ X } (q), i.e. ω_{ X } (v) = S^{ X } [suf _{ X } [i] + k..suf _{ X } [i] + ℓ  1]. At the same time v identifies all occurrences of ω_{ X } (v) in S^{ X } , namely the positions suf _{ X } [i] + k,..., suf _{ X } [j] + k.
For v = 〈k, q, X〉, we therefore also use the notation if X = F and if X = R. As an example, consider the affixinterval v = 〈1, 4  [4..6], F〉 in suf_{F} of the affix array shown in Figure 3. In this case, k = 1, q = 4  [4..6], and X = F. v identifies all occurrences of substring in S^{F} at positions suf_{F}[4] + 1 = 11, suf_{F}[5] + 1 = 8, and suf_{F}[6] + 1 = 5. Observe that is the first suffix of δ_{F}(q) = CUGC due to context k = 1.
Searching RNA databases for RSSPs with affix arrays
Pattern matching using affix arrays means the sequential processing of characters in the pattern guiding the traversal of the data structure. This can be performed in either a traditional lefttoright order resulting in a unidirectional search or in a bidirectional way where character comparison is started at any position of the pattern extending the already matched substring of the pattern to the left or to the right. We will see that bidirectional search using alternating series of left and right extensions is very well suited for fast database search with RNA sequencestructure patterns (RSSPs) containing both paired and unpaired bases. In the following we will explain the two different traversal strategies underlying unidirectional and bidirectional search using affix arrays.
Unidirectional traversal
After m steps, if all q_{ k } could be located, δ_{F}(q_{ m } ), q_{ m } = m  [r..s], matches the pattern P and the occurrences suf_{F}[r], suf_{F}[r + 1],..., suf_{F}[s] of δ_{F}(q_{ m } ) are reported as occurrences of P in S. Note that in this approach the matched substring of S is extended only to the right and at each step k the occurrences of the already matched prefix are represented by a suffixinterval.
Bidirectional traversal
For the bidirectional search, we start at some position in and then compare the pattern P character by character to the text, where we can freely switch between extending to the left or to the right. Note that as in the case of unidirectional search, ambiguous nucleotides x in the pattern can be handled by enumerating all characters c in the corresponding character class φ(x). We can focus on the situation in the search, where

a range r..r' (0 ≤ r ≤ r' < m) of the pattern P is already compared,

the occurrences of a substring of S matching P[r..r'] are represented by an affixinterval v = 〈k, ℓ  [i..j], X〉 in suf_{ X }, and

we want to extend either to the left or to the right by a sequence character (that matches the respective pattern character P[r  1] or P[r' + 1]). This will result in a new, extended affixinterval v_{ x }.
Switch of the search direction
Like its suffixinterval, an affixinterval directly supports extension of the represented substring in only one direction, namely searching to the left for X = F and to the right for X = R. However, there are "corresponding" affixintervals representing the same substring of S but allowing extension to the opposite direction.
If the new search direction differs from the supported search direction of v, this switch of the search direction requires determining the corresponding affixinterval v' in unless i = j or v has nonempty context k ≠ 0. There are these two exceptions, since first if i = j, independently of the value of k, ω_{ X } (v) is already a unique substring of S^{ X } . Second, for a nonempty context k ≠ 0, all occurrences of substring ω_{ X } (v) in S^{ X }are followed (if k > 0) or preceded (if k < 0) by the same substring .
Let k = 0 and i < j. The affixinterval in is called the reverse affixinterval of v = 〈k, ℓ  [i..j], X〉 if and only if j'  i' = j  i, ℓ' ≥ ℓ, and . The interval boundaries i' and j' of v' are determined via a lookup in table aflk _{ X } . We set i' = aflk _{ X } [home _{ X } ([i..j])] and j' = i' + (j  i). Observe that ℓ is not necessarily the length of the longest common prefix of all suffixes in [i..j]. For this reason we define ℓ_{lcp} = min{lcp _{ X } [k]  i < k ≤ j} ≥ ℓ and compute the context of v' as k' = ℓ_{lcp}  ℓ. Further, we set ℓ' = ℓ_{lcp}. Hence the reverse affixinterval is well defined and v' is the required corresponding interval of v.
Right/left cextension of an affixinterval
In our situation, represents the occurrences of a substring u of S matching P[r..r'].
The right (left) extension of v by a character , also called cextension of v, is an operation that computes the affixinterval v_{ x } representing all occurrences of a substring uc (cu). It fails, if there is no such substring. We elaborate the cases for right extension. The cases for left extension are symmetric and therefore omitted. For right cextension of v = 〈k, ℓ  [i..j], X〉, we determine the interval v_{ x } = 〈k_{ x } , ℓ_{ x }  [i_{ x } ..j_{ x } ], X_{ x } 〉 with . The first two cases do not require switching the search direction.

Case X = F and i = j. u is a unique substring of S. If S[suf_{F}[i] + ℓ] = c, then v_{ x }= 〈k, (ℓ + 1)  [i..j], F〉.

Case X = F and i < j. We determine the minimal i_{ x }≥ i and maximal j_{ x }≤ j in suf_{F} such that S[suf_{F}[i_{ x }] + ℓ] = c and S[suf_{F}[j_{ x }] + ℓ] = c by binary search in the suffixinterval ℓ  [i..j]. If i_{ x }and j_{ x }exist, we set v_{ x }= 〈k, (ℓ + 1)  [i_{ x }..j_{ x }], F〉.
The following cases require switching the search direction.

Case X = R, i = j. We evaluate S^{R}[suf_{R}[i] + k  1]. If S^{R}[suf_{R}[i] + k  1] = c, set v_{ x }= 〈k  1, ℓ  [i..j], R〉.

Case X = R, i < j, and k = 0. We first determine the reverse affixinterval v' = 〈k', ℓ'  [i'..j'], F〉 of v via a switch of the search direction as described above. Then we compute the minimal i_{ x }≥ i' and maximal j_{ x }≤ j' via binary search, such that S[suf_{F}[i_{ x }] + ℓ'] = c and S[suf_{F}[j_{ x }] + ℓ'] = c. If i_{ x }and j_{ x }exist, we set v_{ x }= 〈k', (ℓ' + 1)  [i_{ x }..j_{ x }], F〉.

Case X = R, i < j, and k > 0. We evaluate the (k  1)th character of δ_{R}(ℓ  [i..j]). That is, if δ_{R}(ℓ  [i..j])[k  1] = c, then we consume the context k by setting v_{ x }= 〈k  1, ℓ  [i..j], R〉.
The operation fails if v_{ x } cannot be determined.
RSSP matching using affix arrays
Searching a sequence S with an RNA sequencestructure pattern (RSSP) means to find the occurrences of P in S under the complementarity constraints imposed by the structure string R (cf. our definition of RSSPoccurrence). We introduce a search algorithm that checks for complementarity constraints as early as possible in bidirectional search to maximally reduce the search time due to this restriction.
Furthermore, a general pattern is called inconsistent if it does not have any instance. Formally, a pattern is consistent if and only if for each base pair (i, j) it holds that and . An example of an inconsistent RSSP is with P = UAUACACGAN and R = ( ( . . . . . . ) ). is not consistent because there is a base pair (1, 8) ∈ R but the bases P[1] = A and P[8] = A are not complementary. An example of a structure minimal and consistent RSSP is (UNUACACGNR, ( ( . . . . . . ) ) ). Note that a pattern can be transformed into an equivalent structure minimal pattern and checked for consistency in time. For complexity considerations, we can therefore safely assume that patterns are consistent and structure minimal.
In this case, one can restrict the search space by comparing the two positions of each base pair immediately after each other. Due to this, the enumeration of characters matching the pattern symbols at each base pair can be restricted to the smaller number of complementary ones. In the search for a sequencestructure pattern this can reduce the number of enumerated combinations of matching characters exponentially. Thus, for structure minimal patterns (P, R), the nonbranching structure R suggests a search strategy, i.e. an order of left and right extensions, which requires switching the search direction at every base pair but makes optimal use of the complementarity constraints due to the base pairs.
In our algorithm, we switch the search direction only once per base pair when matching the stem region of the pattern, thus halving the number of lookups in the affix link tables compared to a naive algorithm without this optimization. This was also observed by Strothmann [27] whose algorithm did not support RSSPs containing bulges and internal loops.
To match we call procedure bidirsearch initially as bidirsearch(〈0, 0  [0..n], F〉, r_{0}  1, r_{0}), where 〈0, 0  [0..n], F〉 is an affixinterval and r_{0} is any position in the loop region of the RSSP or any position of a completely unpaired pattern. Then, the procedure traverses the affixintervals by performing right and left extensions, while at the same time checking base complementarity of paired positions. This verification takes constant time by using a binary table of size containing all valid base pairings. Matching positions are reported whenever the boundaries of the RSSP are reached.
In principle, we are free to choose any loop position r_{0} (or any position if R is empty) for starting our bidirectional search algorithm. However, in order to reduce the combinatorial explosion of the search space due to ambiguous IUPAC characters, it is preferable to match nonambiguous pattern characters first. To keep the selection simple, we set r_{0} to the position of the first character c in the possible range such that φ(c) is minimal. That is, we start the search with the most specific (least ambiguous) character.
A detailed example of bidirectional RSSP search along with the underlying affix array traversal is provided in Additional file 1 Section S1. We remark that procedure bidirsearch can be extended to support variablelength RSSPs. Such an extended version of bidirsearch is provided in Additional file 1 Section S3.
Analysis
We analyze the complexity for searching in a sequence S of length n for an RSSP of length m < n, where the index structures for S are already computed.
The bidirectional search algorithm requires tables suf_{F} and suf_{R}, lcp_{F} and lcp_{R}, and aflk_{F} and aflk_{R}. Under our assumption that n < 2^{32}, each of the four tables suf _{ X } and aflk _{ X } consumes 4n bytes, and the two tables lcp_{X} are each stored in n bytes (X ∈ {F, R}). This amounts to a space consumption of 18n bytes for the index structures. The algorithm performs a depth first search, where the depth is limited by m, and therefore requires space. The total space complexity is therefore .
We assume that is structure minimal. Such a pattern without ambiguity, i.e. , does not contain base pairs and the search for does not profit from bidirectional search. Although such a pattern is processed by Algorithm 2, it can be handled by Algorithm 1 using only a suffix array and saving some overhead.
Algorithm 1 accomplishes the search for a nonambiguous pattern on the suffix array suf_{F} using binary search for locating intervals in time, where z is the number of occurrences of P in S. We remark that this time bound can be lowered at the price of higher memory consumption to [25] or even [34, 39] time by using additional precomputed information.
Notably, if there is ambiguity but no base pair in , bidirectional search can still be beneficial in practice. This is the case when searching for a pattern in which a string of unambiguous characters is surrounded on both sides by ambiguous IUPAC characters, because the comparison can start at the most specific part of the pattern. The time complexities for searching ambiguous patterns with Algorithm 1 can be estimated as in the worst case of searching for the sequence pattern P consisting only of Ns. Furthermore, note that our Algorithm 2 behaves exactly like Algorithm 1 on patterns without base pairs if we invoke the search procedure with r = 1 and r' = 0.
For a pattern of length m, let p ≥ 0 be the number of base pairs in R. In the worst case P consists only of Ns. Moreover, all possible strings of length m satisfying the complementarity constraints specified in R occur in the text S. Recall that, since we allow (G, U) pairs, there are possible complementary base pairs. Thus, there are such strings and Algorithm 2 spans a virtual tree with paths from the root to a leaf. At each leaf, it reports the occurrences of the respective matched substring.
The cost of each cextension consists of the cost of locating the suffixinterval of the new affixinterval, which is performed by binary search in , and the cost for potentially computing the reverse affixinterval when switching the search direction.
Instead of performing the binary search over the suffix tables, one can use the childtables introduced by Abouelhoda et al. in [34] to determine the child intervals and switch the search direction in constant time. The childtables, however, add at least 2n bytes to the index and require additional involved index construction. As the childtables improve the worst case behavior but, on the other hand, require more space, we analyze the complexity with and without these tables (i.e. with tables suf_{ X }, lcp_{ X }, and aflk _{ X } only).
 (1)
Case i = j or k ≠ 0. If i = j, represents a unique substring of S, or, if k ≠ 0, all occurrences of substring in S are followed (if k > 0) or preceded (if k < 0) by the same substring of length k (known as context). Switching the search direction does not require locating the reverse interval of v, because the algorithm can perform the cextension in the new search direction by consuming context. Therefore, this case requires constant time.
 (2)
Case i < j and k = 0. The algorithm needs to locate the reverse affixinterval of v. Interval boundaries i' = aflk _{ X } [home _{ X } ([i..j])] and j' = i' + (j  i) of v' are computed in constant time.
By definition, computing the reverse affixinterval of v requires knowing ℓ_{lcp}. Then, ℓ' = ℓ_{lcp} and k' = ℓ'  ℓ. Without childtables, we determine ℓ_{lcp} by computing the length of the longest common prefix between and . It suffices to perform ℓ_{lcp}  ℓ + 1 = k' + 1 character comparisons only, since both suffixes and share a common prefix of at least length ℓ. With the help of childtables, ℓ_{lcp} is determined in constant time [34].
Due to the following lemma, the computation of all reverse affixintervals on one path of our virtual tree is in if childtables are not used.
Lemma 3 Using tables suf_{ X }, lcp _{ X }, and aflk _{ X }, the computation of all contexts on a path in the recursion of Algorithm 2 is in .
Proof. Let v_{1}, v_{2}, v_{ t }..., v_{ C }be the sequence of reverse intervals processed when matching , and let k_{ t } denote the context of v_{ t } for 1 ≤ t ≤ C.
To show , let v = 〈k, ℓ  [i..j], X〉, with k = 0, i < j, and X = F (X = R), be the current affixinterval. We assume without loss of generality that we perform a left (right) cextension of v and thus locate the reverse interval . Then the following statements hold: k_{ t } ≥ 0, ℓ_{ t } = ℓ + k_{ t } , and j_{ t }  i_{ t } = j  i (see Lemma 1). Observe that k_{ t } = 0 implies and k_{ t }> 0 implies that substring has a nonempty prefix of length k_{ t } , namely .
Note that v_{ t } is only located if k = 0, otherwise the context k has to be consumed. Hence there is no reverse interval , with 1 ≤ s ≤ C, s ≠ t, and k_{ s }> 0, such that the (k_{ s }  1)th prefix of overlaps with for the same positions in . From this, follows. Since a single context k_{ t } can be determined by performing exactly k_{ t } + 1 character comparisons, this implies time to compute all these contexts. With this, we conclude that all switches of the search direction performed while finding one substring w in S that matches take up to time. □
Therefore, when searching for without childtables, the total time for switching search directions is coarsely estimated by multiplying the complexity for one path with the number of paths as . The use of childtables removes the linear factor.
For the worst case that all strings matching the pattern actually occur as substrings in S, the sequence S must have a certain minimal length. In the case of p = 0, the possible matches are the words in and a sequence that contains all these matches is called ary de Bruijn sequence of order m[40] without wraparound, i.e. a de Bruijn sequence with its first m  1 characters concatenated to its end. Such a sequence was shown to have a length of . As a consequence, the worst case requires n ≥ n_{0}.
We summarize the worstcase time complexities for Algorithm 2 as follows. 1.) From determining new suffixintervals, we get a contribution of . For n ≥ n_{0}, this is in . Childtables reduce this time further to . 2.) Switching directions without childtables is in worstcase time, which is reduced to when using childtables. For n ≥ n_{0}, E_{ m, p } is in . Finally, Algorithm 2 runs in , which is reduced to using childtables (i.e. for n ≥ n_{0}).
One should note that the worstcase time complexity of bidirectional search for sequencestructure pattern is only in the order of online search algorithms. In our implementation, we use a minimal set of tables in order to keep the implementation simple and save space.
However, it can be clearly seen from this analysis that the worst case is based on extremely pessimistic assumptions that are almost contrary to the expected application. 1.) It is assumed that a pattern consists of wildcards N only. In the expected application, however, patterns will often specify bases in the loop region, which is of particular benefit for our algorithm. 2.) Sequences, like the de Bruijn sequence, that contain all possible matches of an average sized pattern will be rare in practice. E.g. it could be assumed that a sequence that contains all possible matches of a pattern Q with p base pairs (and P = N ... N) is at least as long as the ary de Bruijn sequence of order m, since one expects no significant bias for the specific complementarity due to R over all substrings of length m. However, is even for small p much smaller than n_{0} = 4 ^{ m } + m  1. For example, four base pairs (i.e., p = 4) reduce the time bound by a factor of (16/ 6)^{4} ≈ 50 and eight base pairs reduce time by a factor of about 2500.
RNA secondary structure descriptors based on multiple ordered RSSPs
Obviously RNAs with complex, branching structures cannot be described completely by a single RSSP. Describing an RNA by only a single unbranched fragment is often inappropriate, since searching a large sequence database or a complete genome for structurally conserved RNAs (RNA homology search) with a single RSSP will likely generate many spurious matches. However, larger RNAs can often adequately be described by a sequence of RSSPs. This holds for 1,247 out of 1,446 RNA families in Rfam 10.0 which have a structure containing several stemloops but no multiloop. Only 199 out of 1,446 (13.76%) RNA families in Rfam 10.0 containing multiloops cannot be modeled completely this way. Still, the consensus structures of these 199 families contain on average 4.06 stemloops (standard deviation 2.08, median 3) which can be modeled as RSSPs. In consequence, we can use a sequence of RSSPs that consist of at least one pattern per stemloop (and potentially also unstructured patterns) for the description of those families. This allows to accurately identify members even of those families containing multiloops.
An SSD of length L is a sequence of L RSSPs where denotes the RSSP describing A_{ i } , i ∈ [1, L]. The order ≪ of the RSSPs in is imposed by the order of the corresponding alignment blocks. By l_{ i } and r_{ i } we denote the start and end positions of A_{ i } in the multiple alignment, respectively. In practice, can be obtained from multiple sequencestructure alignments of related RNA sequences (i.e., of an RNA family) as they are available in databases like Rfam [3, 4]. A match to is a nonoverlapping sequence of matches for some or all of the RSSPs in in their specified order. We will now make this more precise.
all from , such that for all i, 1 ≤ i ≤ k  1.
To solve the local chaining problem we use our own implementation of a fast local chaining algorithm described in [50] with modified gap costs. While the algorithm of [50] penalizes gaps by the sum of their lengths, our solution is based on the difference between their observed lengths (in the chain of matches) and their expected lengths (as given by the multiple alignment of the family), confer Equation 4. This algorithm runs in O(q log q) time where q is the size of .
To solve the global chaining problem we have developed a new efficient chaining algorithm described next.
An improved method for global RSSP match chaining
So far our description was based on a single sequence. However, the results described below are based on a large set of sequences S_{1},..., S_{ k } as it occurs when searching a large sequence database. I.e. in case of databases like Rfam k can be in the range of millions. To handle these, we concatenate the single sequences with separator symbols and construct the affix array for the concatenation. For a given SSD , all RSSPs , 1 ≤ i ≤ L, are matched one after the other using fast bidirectional search on the affix array. This results in match sets for RSSP . L is typically in the range of tens while the number of RSSP matches for a particular sequence S_{ j } is in the order of hundreds to thousands if S_{ j } is an mRNA or complete genome sequence. For each match f the following information is recorded:
In an initial sorting step the union of all match sets , 1 ≤ i ≤ L, is sorted in ascending order of f.seqnum. Matches with identical sequence numbers are sorted in ascending order of the ordinal number of the RSSP, i.e., by f.rssp. Suppose that b* is the size of . As there are at most b* sequences with at least one RSSP match, the sorting according to the sequence numbers can be done in time and space using the counting sort algorithm [51]. Here, k* is the number of sequences with at least one RSSP match. As k* ≤ b*, the sorting requires time and space. We obtain disjoint subsets , 1 ≤ j ≤ k, where is the set of all matches in matching a substring of S_{ j } . As is ordered by the ordinal number of the RSSP and the counting sort algorithm is stable, the sets are also sorted by the ordinal number of the RSSPs. Let denote the matches such that f.rssp = i. In a second sorting step, each is sorted according to the starting position of the matches. As this is a typical integer sorting problem, it requires time, where b_{ j, i } is the size of . Altogether, the two initial sorting steps can be performed in time.
For all S_{1}, S_{2},..., S_{ k } one now solves independent chaining problems for sets , 1 ≤ j ≤ k, of matches sorted according to the ordinal number of the RSSP and the starting position of the matches in S_{ j } . Let j be fixed, but arbitrary. For each match , the weight f.weight is positive. Hence, an optimal chain ends with a match f such that there is no match f' satisfying f ≪ f'. Similarly, an optimal chain begins with a match f' such that there is no match f satisfying f ≪ f'.
The chaining problem is solved by a dynamic programming algorithm which tabulates for all matches the maximum score f'.score of all chains ending with f'. In addition, it computes the predecessor f'.prec of f' in a chain with maximum score ending with f'. To obtain f'.score, one has to maximize over all matches f such that f.rssp < f'.rssp and f.pos + f.length  1 < f'.pos. This is a two dimensional search problem. As the matches in are already sorted according to the first dimension (i.e., by the ordinal number of the RSSP), one can reduce it to a one dimensional sorting problem. This has already been observed in [50], and led to the development of an algorithm solving the chaining problem in , where b is the number of matches in . However, the algorithm of [50] was developed for chaining pairwise sequence matches. The RSSP chaining problem is a special instance of this problem: the first "sequence" consists of the positions 1,..., L, and a match for RSSP is a match of length one to position i. Moreover, matches at position i in the first sequence can be treated as being of equal length because they are matches to the same RSSP . In addition to this, our initial sorting step delivers, for all i, 1 ≤ i ≤ L, the matches in in sorted order according to the starting position in S_{ j } . All these properties allow us to simplify and improve the algorithm of [50] in the following aspects:

While the algorithm of [50] requires a dictionary data structure with insert, delete, predecessor, and successor operations running in logarithmic time (e.g., an AVLtree or a redblack tree [51]), our approach only needs a linear list, which is much easier to implement and requires less space.

While the algorithm of [50] requires an initial sorting step using time, our method only needs time for this step. Note that the b_{ j, i }satisfy .

While the algorithm of [50] solves the chaining problem for in time, our approach runs in time. If L is considered to be a constant, the running time becomes linear in b, where .
To explain our algorithm, let i, 1 ≤ i ≤ L be arbitrary but fixed and assume that all match sets have been processed. In a first loop over the sorted matches in one determines the score of the matches. In a second loop, one inserts them into a linear list if necessary. The linear list contains a subset of the previously processed and scored matches. This split of the computation into two loops is different from the algorithm of [50] where the scoring and insertions are interweaved in one loop, requiring an extra array of length 2b containing references to the matches. The separation into two loops allows us to get rid of this extra array.
Now consider the first loop over all elements in in sorted order of the match position in S_{ j } . Let f' be the current element. At this point, all matches f such that f.rssp < f'.rssp have been processed already. In particular, the score f.score and the previous match (if any) in an optimal chain ending with f has been determined. Among the processed matches we only have to consider those matches f satisfying f.pos + f.length  1 < f'.pos. If there is such a match, one takes the one with maximal score, say f. Then, the optimal chain ending with f' contains the previous match f, and the score is f'.score = f'.weight + f.score. If there is no such match, then the optimal chain ending with f' only consists of f' and f'.score = f'.weight.
Now consider the second loop over all elements in for which the scores and predecessor matches (if any) are already determined. Let f' be the current element to be inserted. As explained in the previous case, one has to make sure that, among the processed matches, one can efficiently determine the match f with the maximum score such that f.pos + f.length  1 is smaller than some value depending on f'. The processed matches are stored in a linear list which is sorted in ascending order of the position of the matches in S_{ j } . Let ≺ _{ pos } denote this order, that is f ≺_{ pos }f" if and only if f.pos + f.length < f".pos + f".length for any matches f and f". If for two processed matches f and f" one has f.pos < f".pos and f.score > f".score, then an optimal chain does not include f". Each chain that uses f" can also use f and increase the chain score. As a consequence, one has to take care that f" is not inserted into the linear list or it is deleted if it was inserted earlier. In this way, f ≺_{ pos }f" always implies f.score ≤ f".score for two matches f and f" in the linear list. As the elements to be scored in the first loop and to be inserted in the second loop are ordered in the same way as the elements in the linear list, one can perform the scoring and the insertion loop (which also may involve deletions) by merging two lists of length l_{1} and l_{2} in time where l_{1} is the number of matches to be scored and inserted and l_{2} is the length of the linear list involved. Let . As l_{1} + l_{2} ≤ b, one obtains a running time of for each set . As there are L such sets, the running time is .
Results
Implementation and computational results
We implemented (1) the algorithms necessary for affix array construction, (2) the fast bidirectional search of RSSPs using affix arrays as sketched in Algorithm 2 (hereinafter called BIDsearch), (3) an online variant operating on the plain sequence (hereinafter called ONLsearch) for validation of BIDsearch and reference benchmarking, and (4) the efficient global and local chaining algorithms. Algorithm ONLsearch shifts a window of length m = RSSP along the sequence of length n to be searched and compares the substring inside the window with the RSSP from left to right until a mismatch occurs. Hence, it runs in time in the worst and time in the best case. Algorithms BIDsearch and ONLsearch were implemented in the program afsearch. The afconstruct program makes use of routines from the libdivsufsort2 library (see http://code.google.com/p/libdivsufsort/) for computing the suf_{F} and suf_{R} tables in time. For the construction of the lcp_{F} and lcp_{R} tables we employ our own implementation of the linear time algorithm of [36]. Tables aflk_{F} and aflk_{R} are constructed in worstcase time with fast practical construction time due to the use of the skip tables skp_{F} and skp_{R}[37]. The programs were compiled with the GNU C compiler (version 4.3.2, optimization option O3) and all measurements were performed on a Quad Core Xeon E5410 CPU running at 2.33 GHz, with 64 GB main memory (using only one CPU core). To minimize the influence of disk subsystem performance the reported running times are user times averaged over 10 runs. Allowed base pairs were canonical WatsonCrick (A, U), (U, A), (C, G), (G, C), and wobble (G, U), (U, G), unless stated otherwise.
Affix array construction times
We also measured the running time of afconstruct to construct the affix array for a set of 3,192,599 RNA sequences with a total length of ~ 622 MB compiled from the full alignments of all Rfam release 10.0 families. The construction and storage on disk required 126 minutes. In the following we refer to this dataset as RFAM10 for short.
Influence of loop length on search performance
Experiment 2: Obtained speedup of BIDsearch over ONLsearch for different loop length l ∈ {3,..., 20} and number of nonambiguous characters in the loop region q ∈ {0, ... , 4}
l  3  4  5  6  7  8  9  10  11 

q = 0  78.10  48.64  35.42  23.55  16.35  11.01  7.31  4.89  3.48 
q = 1  329.81  180.45  105.67  57.41  33.75  19.20  11.30  7.14  4.81 
q = 2  749.94  418.65  227.45  121.80  67.81  36.99  21.44  12.73  8.41 
q = 3  2,345.17  1,169.53  653.31  353.49  188.34  103.34  56.59  33.08  20.79 
q = 4  11,045.75  3,638.14  2,144.8  1,132.53  610.63  338.77  184.56  106.11  64.93 
l  12  13  14  15  16  17  18  19  20 
q = 0  2.67  2.15  1.79  1.51  1.37  1.20  1.13  1.07  1.00 
q = 1  3.58  3.13  2.28  1.89  1.68  1.46  1.35  1.27  1.12 
q = 2  5.96  4.88  3.64  2.94  2.57  2.19  2.02  1.82  1.63 
q = 3  14.27  11.88  8.25  6.50  5.53  4.74  4.19  3.76  3.34 
q = 4  43.09  35.23  25.74  19.52  15.91  13.25  11.75  10.32  9.28 
Searching large sequence databases
Experiment 3 (A): Running times in seconds needed by the programs to search for 397 RSSPs describing 42 RFAM10 families in ~ 622 megabases of RNA sequence data.
BIDsearch  ONLsearch  RNAMotif  RNABOB 

46.1(1)  6,203(134.5)  11,745(254.7)  9,061(196.5) 
We also investigated the distribution of speedup factors obtained by BIDsearch when searching for the 397 RSSPs. We observed that BIDsearch is more than 50,000 times faster than RNABOB and RNAMotif for the majority of the patterns and that the total search time required by BIDsearch is dominated by only a small number of patterns. These patterns describe large unconserved loop regions. See Figure S3 in Additional file 1 for a graphical visualization of the distribution of speedup factors.
Scaling behavior of bidirectional pattern search using affix arrays
In this experiment BIDsearch clearly showed a sublinear scaling behavior, whereas ONLsearch scaled only linearly. It took BIDsearch only 566.8 (pattern1), 133.8 (pattern2), and 37.1 (pattern3) milliseconds to search the whole RFAM10 dataset. The obtained speedups of BIDsearch over ONLsearch were in the range from 4.63 (1 MB subset) to 104.79 (full RFAM10) for pattern1, from 12.23 (1 MB subset) to 223.18 (full RFAM10) for pattern2, and from 35.0 (1 MB subset) to 618.37 (full RFAM10) for pattern3. We observe again that the specification of only one or two nucleotides in an RSSP's loop region considerably reduces the running time of the BIDsearch algorithm.
RNA family classification by global chaining of RSSP matches
We also employed global chaining to detect members of the structurally more complex family of Citrus tristeza virus replication signal (Rfam Acc.: RF00193). Therefore we built an SSD comprising 8 RSSPs, describing 8 of 10 stemloops the molecule is predicted to fold into. For more information on the molecule's secondary structure and the used descriptor, see Additional file 1 Figure S4. Using Structator operating in BIDsearch (ONLsearch) mode and global chaining of RSSP matches it took only 1.3 (138.7) seconds to search RFAM10 with this SSD, where 0.06 seconds were required for the chaining. The computed global chains with a minimum length of 5, computed from the 184,199 single RSSP matches, were ranked according to their global chain score. We observe that the sequences containing the 37 highest scoring chains are exactly all 37 members of the family.
In addition we measured the performance of Structator using global chaining for RNA family classification with manually compiled SSDs for 42 Rfam families. For the results of this experiment see Additional file 1 Table S4.
Searching whole genomes using local chains of RSSP matches
Comparison of implementations of bidirectional pattern search
Search time comparison between Structator' s BIDsearch and an implementation, here called BWI, of bidirectional search using the wavelet tree data structure described in [54].
hairpin1  hairpin2  hairpin4  hloop(5)  acloop(5)  acloop(10)  

BWI  10,484  64  612  26,413  896  420 
BIDsearch  8,325  32  330  16,768  511  295 
BIDsearch vs. BWI  1.26  2  1.85  1.58  1.75  1.42 
Structator software package
Structator is an opensource software package for fast database search with RNA structural patterns implementing the algorithms and ideas presented in this work. It consists of the command line programs afconstruct and afsearch.
afconstruct implements all algorithms necessary for affix array construction, namely a lightweight suffix sorting algorithm for construction of the suffix arrays suf_{F} and suf_{R}, the algorithm for construction of tables lcp_{F} and lcp_{R}[36], and the algorithm for computation of the affix link tables aflk_{F} and aflk_{R}. The program constructs all or if necessary only some of the tables of the affix array for a target database provided in FASTA format and stores them on disk. Therefore the program can also be used to compute only the tables needed for a traditional enhanced suffix array [34]. afconstruct can handle RNA as well as DNA sequences. Moreover, it supports the transformation of input sequences according to userdefined (reduced) alphabets and allows the index construction for transformed sequences. Such personalized alphabets are easily specified in a text file.
The search is performed efficiently on a precomputed affix array. afsearch implements the bidirectional indexbased search algorithms BIDsearch and the online algorithm ONLsearch operating on the plain sequence, both extended to support patterns with variable loop size and/or stem length. Further, it implements the methods for fast global and local chaining of RSSP matches. The search with RSSPs can be performed on the forward and, in case of nucleotide sequences, also on the reverse strand. Searching on the reverse strand is implemented by reversal of the RSSP and transformation according to WatsonCrick base pairing. Hence it is sufficient to build the affix array for one strand only.
RSSP matches can be reported directly by afsearch or can be used as input for the computation of highscoring global or local chains of matches. Computed chains resemble the order of the RSSPs given in the pattern file and are reported in descending order of their chain score. This allows the description of complex secondary structures with our new concept of secondary structure descriptors (SSDs). This is done by simply specifying a series of RSSPs in the pattern file describing the stemloop substructures the RNA molecule is composed of in the order of their occurrence in 5' to 3' direction. To incorporate different levels of importance or significance of an RSSP into SSD models and subsequently in the computation of chain scores, RSSP specific weights can be defined in the pattern file. This is particularly useful in the context of RNA family classification where the used SSD may be derived from a multiple sequencestructure alignment or a consensus structureannotated multiple sequence alignment. Here, it permits the assignment of higher weights to RSSPs describing highly conserved functionally important structural elements occurring in a family of RNAs, and lower weights to RSSPs describing less conserved substructures that occur only in certain members of the family.
The output format of afsearch contains all available information of a match or chain of matches, either in a humanreadable, or a tabdelimited format. Moreover, afsearch can also report matches in BED format. This allows a direct visualization of the results in e.g. the UCSC genome browser.
Discussion and conclusion
We have presented a method for fast indexbased search of RNA sequencestructure patterns (RSSPs), implemented in the Structator software. As part of the software, we give the first publicly available implementation of bidirectional pattern search using the affix array data structure. For the majority of biologically relevant RSSPs, our implementation of BIDsearch shows superior performance over previous programs. In a benchmark experiment on the Rfam database, BIDsearch was faster than RNAMotif and RNABOB by up to two orders of magnitude. Furthermore, in a comparison between BIDsearch and the program of [54], which works on compressed index data structures, BIDsearch was faster by up to 2 times. We observed that for RSSPs with long unconserved loop regions, the advantage of BIDsearch over ONLsearch decreases. For such cases, Structator can also employ ONLsearch on the plain sequence data. As a further contribution, we presented for the first time a detailed complexity analysis of bidirectional search using affix arrays. While bidirectional search does not does not improve the worstcase time complexity compared to online search, in practice it runs much faster than online search algorithms and the running time scales sublinearly with the length n of the searched sequences.
Our implementation of the affix array data structure requires only 18n bytes of space. This is a significant space reduction compared to the ~ 45n bytes needed for the affix tree. With the program afconstruct we present for the first time a command line tool for the efficient construction and persistent storage of affix arrays that can also be used as a standalone program for index construction.
With the new concept of RNA secondary structure descriptors (SSDs) combined with fast global and local chaining algorithms, all integrated into Structator, we also introduce a powerful technique to describe RNAs with complex secondary structures. This even allows to effectively describe RNA families containing branching substructures like multiloops, by decomposition into sequences of nonbranching substructures that can be described with RSSPs. Compared to programs like RNAMotif , Structator' s pattern description language for RSSP formulation is simple but powerful, in particular in combination with the SSD concept. Beyond the algorithmic contributions, we provide with the Structator software distribution a robust, welldocumented, and easytouse software package implementing the ideas and algorithms presented in this manuscript.
Availability
The Structator software package including documentation is available in binary format for different operating systems and architectures and as source code under the GNU General Public License Version 3. See http://www.zbh.unihamburg.de/Structator for details.
Declarations
Acknowledgements
This work was supported by the German Research Foundation (grant WI 3628/11). We also thank the anonymous referees, especially referee 2, for their valuable comments and suggestions.
Authors’ Affiliations
References
 Mattick J: RNA regulation: a new genetics? Nat Rev Genet 2004, 5(4):316–323. 10.1038/nrg1321View ArticlePubMedGoogle Scholar
 Mattick J, Taft R, Faulkner G: A global view of genomic information  moving beyond the gene and the master regulator. Trends Genet 2009.Google Scholar
 Gardner P, Daub J, Tate J, Moore B, Osuch I, GriffithsJones S, Finn R, Nawrocki E, Kolbe D, Eddy S, Bateman A: Rfam: Wikipedia, clans and the "decimal" release. Nucl. Acids Res 2010.Google Scholar
 Gardner P, Daub J, Tate J, Nawrocji E, Kolbe D, Lindgreen S, Wilkinson A, Finn R, GriffithJones S, Eddy S, Bateman A: Rfam: updates to the RNA families database. Nucl. Acids Res 2008, 37: D136D140.PubMed CentralView ArticlePubMedGoogle Scholar
 Gardner PP, Wilm A, Washietl S: A benchmark of multiple sequence alignment programs upon structural RNAs. Nucl. Acids Res 2005, 33(8):2433–9. 10.1093/nar/gki541PubMed CentralView ArticlePubMedGoogle Scholar
 Höchsmann M, Voss B, Giegerich R: Pure multiple RNA secondary structure alignments: a progressive profile approach. IEEE/ACM Trans Comput Biol Bioinform 2004, 1: 53–62. 10.1109/TCBB.2004.11View ArticlePubMedGoogle Scholar
 Siebert S, Backofen R: MARNA: multiple alignment and consensus structure prediction of RNAs based on sequence structure comparisons. Bioinformatics 2005, 21(16):3352–3359. 10.1093/bioinformatics/bti550View ArticlePubMedGoogle Scholar
 Sankoff D: Simultaneous solution of the RNA folding, alignment and protosequence problem. SIAM Journal on Applied Mathematics 1985, 45: 810–825. 10.1137/0145048View ArticleGoogle Scholar
 Gorodkin J, Heyer LJ, Stormo GD: Finding the most significant common sequence and structure motifs in a set of RNA sequences. Nucl. Acids Res 1997, 25(18):3724–32. 10.1093/nar/25.18.3724PubMed CentralView ArticlePubMedGoogle Scholar
 Havgaard J, Lyngso R, Stormo G, Gorodkin J: Pairwise local structural alignment of RNA sequences with sequence similarity less than 40%. Bioinformatics 2005, 21: 1815–1824. 10.1093/bioinformatics/bti279View ArticlePubMedGoogle Scholar
 Mathews DH, Turner DH: Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. Journal of Molecular Biology 2002, 317(2):191–203. 10.1006/jmbi.2001.5351View ArticlePubMedGoogle Scholar
 Will S, Reiche K, Hofacker IL, Stadler PF, Backofen R: Inferring noncoding RNA families and classes by means of genomescale structurebased clustering. PLoS Comput. Biol 2007, 3(4):e65. 10.1371/journal.pcbi.0030065PubMed CentralView ArticlePubMedGoogle Scholar
 Macke T, Ecker D, Gutell R, Gautheret D, Case D, Sampath R: RNAMotif  A new RNA secondary structure definition and discovery algorithm. Nucl. Acids Res 2001, 29(22):4724–4735. 10.1093/nar/29.22.4724PubMed CentralView ArticlePubMedGoogle Scholar
 Gautheret D, Major F, Cedergren R: Pattern searching/alignment with RNA primary and secondary structures: an effective descriptor for tRNA. Comput Appl Biosci 1990, 6(4):325–31.PubMedGoogle Scholar
 RNABOB: a program to search for RNA secondary structure motifs in sequence databases[http://selab.janelia.org/software.html]
 Chang T, Huang H, Chuang T, Shien D, Horng J: RNAMST: efficient and flexible approach for identifying RNA structural homologs. Nucl. Acids Res 2006, 34: W423W428. 10.1093/nar/gkl231PubMed CentralView ArticlePubMedGoogle Scholar
 Dsouza M, Larsen N, Overbeek R: Searching for patterns in genomic data. Trends Genet 1997, 13(12):497–8.View ArticlePubMedGoogle Scholar
 Grillo G, Licciulli F, Liuni S, Sbisà E, Pesole G: PatSearch: A program for the detection of patterns and structural motifs in nucleotide sequences. Nucl. Acids Res 2003, 31(13):3608–12. 10.1093/nar/gkg548PubMed CentralView ArticlePubMedGoogle Scholar
 Nawrocki E, Eddy S: Querydependent banding (QDB) for faster RNA similarity searches. PLoS Comput. Biol 2007., 3(56):Google Scholar
 Nawrocki E, Kolbe D, Eddy S: Infernal 1.0: inference of RNA alignments. BMC Bioinformatics 2009, 25: 1335–1337.View ArticleGoogle Scholar
 Klein R, Eddy S: RSEARCH: finding homologs of single structured RNA sequences. BMC Bioinformatics 2003, 4: 44. 10.1186/14712105444PubMed CentralView ArticlePubMedGoogle Scholar
 Sakakibara Y: Pair hidden markov models on tree structures. BMC Bioinformatics 2003, 19: i232–40.View ArticleGoogle Scholar
 Gautheret D, Lambert A: Direct RNA motif definition and identification from multiple sequence alignments using secondary structure profiles. J Mol Biol 2001, 313: 1003–11. 10.1006/jmbi.2001.5102View ArticlePubMedGoogle Scholar
 Gusfield D: Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge Univ. Press; 1997.View ArticleGoogle Scholar
 Manber U, Myers E: Suffix arrays: a new method for online string searches. SIAM Journal on Computing 1993, 22(5):935–948. 10.1137/0222058View ArticleGoogle Scholar
 Ferragina P, Manzini G: Indexing compressed text. Journal of the ACM 2005, 52(4):552–581. 10.1145/1082036.1082039View ArticleGoogle Scholar
 Strothmann D: The affix array data structure and its applications to RNA secondary structure analysis. Theor. Comput. Sci 2007, 389(1–2):278–294.View ArticleGoogle Scholar
 Mauri G, Pavesi G: Algorithms for pattern matching and discovery in RNA secondary structure. Theor. Comput. Sci 2005, 335: 29–51. 10.1016/j.tcs.2004.12.015View ArticleGoogle Scholar
 Maaß MG: Linear bidirectional online construction of affix trees. Algorithmica 2003, 37: 43–74. 10.1007/s0045300310292View ArticleGoogle Scholar
 Mauri G, Pavesi G: Pattern discovery in RNA secondary structures using affix trees. In Proceedings of the 14th Annual Symposium on Combinatorial Pattern Matching. Volume 2676. Springer; 2003:278–294. 10.1007/3540448888_21View ArticleGoogle Scholar
 Kärkkäinen J, Sanders P: Simple linear work suffix array construction. In Proceedings of the 13th International Conference on Automata, Languges and Programming. Springer; 2003.Google Scholar
 Puglisi SJ, Smyth W, Turpin A: The performance of linear time suffix sorting algorithms. In DCC '05: Proceedings of the Data Compression Conference. Washington, DC, USA: IEEE Computer Society; 2005:358–367.View ArticleGoogle Scholar
 Manzini G, Ferragina P: Engineering a lightweight suffix array construction algorithm. Algorithmica 2004, 40: 33–50. 10.1007/s0045300410941View ArticleGoogle Scholar
 Abouelhoda M, Kurtz S, Ohlebusch E: Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms 2004, 2: 53–86. 10.1016/S15708667(03)000650View ArticleGoogle Scholar
 Fischer J: Wee LCP. Information Processing Letters 2010, 110(8–9):317–320. 10.1016/j.ipl.2010.02.010View ArticleGoogle Scholar
 Kasai T, Lee G, Arimura H, Arikawa S, Park K: Lineartime longestcommonprefix computation in suffix arrays and its applications. Proceedings of the 18th Annual Symposium on Combinatorial Pattern Matching 2001, 181–192.View ArticleGoogle Scholar
 Beckstette M, Homann R, Giegerich R, Kurtz S: Fast index based algorithms and software for matching position specific scoring matrices. BMC Bioinformatics 2006, 7: 389. 10.1186/147121057389PubMed CentralView ArticlePubMedGoogle Scholar
 Beckstette M, Homann R, Giegerich R, Kurtz S: Significant speedup of database searches with HMMs by search space reduction with PSSM family models. Bioinformatics 2009, 25(24):3251–3258. 10.1093/bioinformatics/btp593PubMed CentralView ArticlePubMedGoogle Scholar
 Abouelhoda MI, Ohlebusch E, Kurtz S: Optimal exact string matching based on suffix arrays. In Proceedings of the 9th International Symposium on String Processing and Information Retrieval. Volume 2476. Springer; 2002:31–43.View ArticleGoogle Scholar
 de Bruijn N: A combinatorial problem. Koninklijke Nederlandse Akademie v. Wetenschappen 1946, 49: 758–764.Google Scholar
 Gardner P, Giegerich R: A comprehensive comparison of comparative RNA structure prediction approaches. BMC Bioinformatics 2004., 5(140):Google Scholar
 Hofacker I, Fekete M, Stadler P: Secondary structure prediction for aligned RNA sequences. Journal of Molecular Biology 2002, 319(5):1059–66. 10.1016/S00222836(02)00308XView ArticlePubMedGoogle Scholar
 Knudsen B, Hein J: Pfold: RNA secondary structure prediction using stochastic contextfree grammars. Nucl. Acids Res 2003, 31(13):3423–8. 10.1093/nar/gkg614PubMed CentralView ArticlePubMedGoogle Scholar
 Hofacker I: RNA consensus structure prediction with RNAalifold. Methods Mol Biol 2007, 395: 527–544. 10.1007/9781597455145_33View ArticlePubMedGoogle Scholar
 Bremges A, Schirmer S, Giegerich R: Finetuning structural RNA alignments in the twilight zone. BMC Bioinformatics 2010., 11(222):Google Scholar
 Torarinsson E, Havgaard J, Gorodkin J: Multiple structural alignment and clustering of RNA sequences. Bioinformatics 2007, 23: 926–932. 10.1093/bioinformatics/btm049View ArticlePubMedGoogle Scholar
 Harmanci A, Sharma G, Mathews D: Efficient pairwise RNA structure prediction using probabilistic alignment constraints. BMC Bioinformatics 2007., 8(130):Google Scholar
 Reeder J, Giegerich R: Consensus shapes: an alternative to the Sankoff algorithm for RNA consensus structure prediction. Bioinformatics 2005, 21(17):3516–23. 10.1093/bioinformatics/bti577View ArticlePubMedGoogle Scholar
 Wilm A, Higgins D, Notredame C: RCoffee: a method for multiple alignment of noncoding RNA. Nucl. Acids Res 2008., 36(9):Google Scholar
 Abouelhoda M, Ohlebusch E: Chaining algorithms for multiple genome comparison. J. Discrete Algorithms 2005, 3(2–4):321–341. 10.1016/j.jda.2004.08.011View ArticleGoogle Scholar
 Cormen T, Leiserson C, Rivest R: Introduction to algorithms. Cambridge, MA: MIT Press; 1990.Google Scholar
 Altuvia S, Zhang A, Argaman L, Tiwari A, Storz G: The Escherichia coli OxyS regulatory RNA represses fhlA translation by blocking ribosome binding. EMBO 1998, 15(20):6069–75.View ArticleGoogle Scholar
 Pollard K, Salama S, Lambert N, Lambot M, Coppens S, Pedersen J, Katzman S, King B, Onodera C, Siepel A, Kern A, Dehay C, Igel H, Ares M, Vanderhaeghen P, Haussler D: An RNA gene expressed during cortical development evolved rapidly in humans. Nature 2006, 443(7108):167–172. 10.1038/nature05113View ArticlePubMedGoogle Scholar
 Schnattinger T, Ohlebusch E, Gog S: Bidirectional search in a string with wavelet trees. In Proceedings of the 21st Annual Symposium on Combinatorial Pattern Matching. Volume 6129. Springer; 2010:40–50. 10.1007/9783642135095_5View ArticleGoogle Scholar
 Darty K, Denise A, Ponty Y: VARNA: Interactive drawing and editing of the RNA seondary structure. Bioinformatics 2009, 25(15):1974–1975. 10.1093/bioinformatics/btp250PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.