RNA motif search with data-driven element ordering

Background In this paper, we study the problem of RNA motif search in long genomic sequences. This approach uses a combination of sequence and structure constraints to uncover new distant homologs of known functional RNAs. The problem is NP-hard and is traditionally solved by backtracking algorithms. Results We have designed a new algorithm for RNA motif search and implemented a new motif search tool RNArobo. The tool enhances the RNAbob descriptor language, allowing insertions in helices, which enables better characterization of ribozymes and aptamers. A typical RNA motif consists of multiple elements and the running time of the algorithm is highly dependent on their ordering. By approaching the element ordering problem in a principled way, we demonstrate more than 100-fold speedup of the search for complex motifs compared to previously published tools. Conclusions We have developed a new method for RNA motif search that allows for a significant speedup of the search of complex motifs that include pseudoknots. Such speed improvements are crucial at a time when the rate of DNA sequencing outpaces growth in computing. RNArobo is available at http://compbio.fmph.uniba.sk/rnarobo. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1074-x) contains supplementary material, which is available to authorized users.


an optional search order
Each structural element is either single stranded (denoted by s) or helical (denoted by h or r). Detailed specification of each element consists of the following parts (the fields in bold are mandatory, while the fields in italic are optional ): (1.) the number of mismatches allowed (in helical elements, mismatches are allowed only on the positive strand), (1b.) the number of mispairs allowed (for helical elements only), (2.) the number of single nucleotide insertions allowed, (3.) primary sequence constraints: a string composed of IUPAC nucleotide codes and wild cards "*". A wild card matches one nucleotide or none. Alternatively, an abbreviation for e.g. 10 wild cards can be written as "[10]", (3b.) primary sequence constraints for the negative strand of a helical element. In helical elements, wild cards can occur only in pairs, i.e. for every wild card there must be a corresponding wild card on the other strand at the exactly opposite position, (4.) IUPAC nucleotide code for allowed insertions, (5.) a transformation string specifying pairings allowed in a relational element of type r (see details below).
Formatting of these fields is illustrated on the following simple motif composed of two elements: a helix h1 capped by a single strand s1.

ACCRNNT
Unlike RNAbob, RNArobo allows nucleotide insertions in individual elements. Syntax for specifying insertions is similar to the specification of the maximum number of mismatches or mispairs. The user has to specify the maximum number of insertions in the structural element and the identity of inserted nucleotides in the IUPAC code. Insertions are not allowed to occur at the very beginning and end of the matched regions, and insertions in helical regions cannot be adjacent nor opposite.
The following example demonstrates the use of insertions in a descriptor: h1 s1 h1' h1 0:0:2 NNN**CC:GG**NNN:A s1 0:1 ACCRNNT:Y In the h1 helix we allow up to 2 insertions of adenosine, while in the single stranded element s1 only one insertion of a pyrimidine nucleotide is allowed ('Y' stands for cytosine or thymine/uracil).
To specify a custom pairing function for a helical element, a relational element of type r can be used instead of a standard helix of type h, as in this variant of the previous descriptor: r1 s1 r1' r1 0:0:2 NNN**CC:GG**NNN:A TGCA s1 0:1 ACCRNNT:Y The relational element r1 allows only canonical base-pairs A-T and C-G. The individual IUPAC codes in the transformation string TGCA define nucleotides that can pair with A, C, G, and T, respectively, in this order. For default helical elements of type h, RNArobo allows also G-U wobble pair, as the default transformation string is TGYR.
The last line of a descriptor can contain an optional reorder command, which specifies the order in which elements are internally searched by the RNArobo algorithm, similarly to RNAMot Gautheret et al. (1990). If this command is absent or does not contain all elements, the automatic data-driven method is used to determine the best possible ordering of all remaining elements. This command has no principal impact on the actual results of the search, but defining a previously trained order can speed up the search by few seconds. Here is the previous descriptor with the element ordering line added: r1 s1 r1' r1 0:0:2 NNN**CC:GG**NNN:A TGCA s1 0:1 ACCRNNT:Y R s1 r1

S2 Dynamic Programming Recurrences
In this section, we describe the dynamic programming recurrences for finding all matches of a single element from the descriptor in a sequence window. We start by describing the algorithm for single-strand elements. Let us consider finding matches of a single-stranded pattern P in a text T with at most M mismatches and I insertions of a single-letter pattern P I .
We use four dimensions of a five-dimensional table S to keep track of position in T , position in P , the number of occurred mismatches, and the number of insertions, respectively. The fifth dimension is binary, and is intended to serve as a flag, whether the previous aligned symbol of T is an insertion, as one insertion cannot follow another.
Formally, we define a binary function S t,p,m,i,b as follows: We start by computing values of S for the empty prefix of the pattern, using the initial condition ∀t ∈ {0 . . . |T |} S t,0,0,0,1 = 1.
The remaining values are computed using the following recurrence: A match of the pattern P is found in the text T ending at position t ≤ |T | with m ≤ M mismatches and i ≤ I insertions if S t,|P |,m,i,0 = 1. Now we turn our attention to the more complex case of paired elements. The problem is to find all occurrences of a paired pattern P : P where P , P are patterns for individual strands (where |P | = |P |) in a text T . Furthermore, we allow for imperfect matches with up to M mismatches, R mispairings, and with at most I insertions of a single-letter pattern P I together in both strands.
To address this pattern matching problem, we introduce a function H, and a recurrence formula for its computation. The binary function H t1,t2,p,m,r,i,b for paired (helical) elements is the following: Again, we start with initial conditions for the empty prefix of the pattern: The recurrence for the remaining values works as follows: A match of the pattern P : P is found in the text T , P ending at position t 1 ≤ |T |, P beginning at position t 2 ≤ |T | with m ≤ M mismatches, r ≤ R mispairs, and i ≤ I insertions if H t1,t2,|P |,m,r,i,0 = 1.

S3 Information content heuristic
In this section, we describe the details of the information content heuristic function h 1 omitted from the main text (Section "Element Ordering"). This function is an approximation of the information content of an element, favoring elements that pose more specific constraints.
For a single-stranded element S, let N be the length of its longest possible match and let X be the number of sequences of length N that contain a match of the element starting at the first position. As explained in the main text, the information content of element S is then estimated as h 1 (S) = 2N − log 2 X.
Since the value of X is hard to compute for complex elements, we instead use an upper bound X U ≥ X. To obtain the upper bound, we count different ways of obtaining a sequence matching S, disregarding the fact that some sequences may be obtained in several different ways and consequently counted multiple times.
In the simplest case, element S does not contain any flexible-length wild cards and does not allow for any distortions (mismatches, insertions). The element specifies for each position i the set of allowed nucleotides; let C[i] be the size of this set. The value of X is then simply (1) Next we extend the bound to cases when S contains wild cards. Each wild card corresponds to an arbitrary nucleotide or to an empty string. A block of k consecutive wild cards thus corresponds to an arbitrary sequence of length up to k. Let X 1 be the value obtained by formula (1) if we consider only nonwild card positions in S. A single block of k consecutive wild cards increases the value of N (the length of the longest occurrence of S) by k. These k additional nucleotides can be arbitrary, and are split into a block of length i matching the block of wild cards and a block of length k − i located after the element occurrence (this block corresponds to the unused wild cards). Since the value of i can be any integer between 0 and k, this leads to the upper bound of X 1 (k + 1)4 k sequences matching S. If S has multiple blocks of wild cards of lengths k 1 , . . . , k b , each of them can be split into two blocks independently, leading to the upper bound (2) Next, let us assume that element S allows up to M M mismatches. Let N be the number of positions where a mismatch can occur (in this count we omit wild cards as well as positions where S allows any nucleotide). Let us denote by A the set of positions where mismatches actually occur (the size of A is at most M M ). There are i∈A C[i] ways to place matching nucleotides at positions in A and i∈A (4 − C[i]) ways to place mismatches at those positions. We could obtain our estimate by enumerating all sets A of size at most M M : Instead, we could use an upper bound by assuming C[i] to be 1 at every position, allowing for a simpler formula.
In the real application, we instead replace C[i] by the empirical meanC of all the positions where a mismatch can occur, obtaining an expression which is not guaranteed to be an upper bound of X, but works well in practice:C Finally, we address the case where S allows up to M I insertions. The nucleotides to be inserted are constrained in the descriptor to come from some set of size C ins . Let N be the number of positions in S, where an insertion can occur. If the actual number of insertions is j, they can be inserted in N j C j ins ways. Similarly to unused wild cards, we have to add padding for each unused insertion to achieve sequences of total length N . Thus we obtain the following approximate upper bound X U for count X: Computation for paired element is analogous. Let H be an element consisting of two paired strands H 1 and H 2 , and let N be the maximum length of a match to one of these two strands, after accounting for wild cards and insertions. Let X be the number of pairs of sequences of length N such that H 1 occurs in the first sequence starting at the first position, and H 2 occurs in the second sequence ending at the last position, and these two occurrences satisfy the complementarity constraints with up to allowed amount of distortion. We again compute an approximate upper bound X U instead of the number X, counting different ways that such a matching can occur. Then we use the following estimate of the information content of a paired element H: As with single-stranded elements, we first count the number of sequences that match H without considering wild cards and distortions. Let P [i] be the number of valid base pairs between position i of H 1 and the corresponding position of H 2 . The value of P [i] is determined by both complementarity constraints specified by H and by sequence constraints for the respective positions in H 1 and H 2 . As before, the total number of matching sequences is the product Wild cards occur in symmetrical positions of H 1 and H 2 , and each block of k wild cards has to be matched by j ≤ k valid base pairs. Therefore, the estimate for b blocks of wild cards with lengths k 1 , . . . , k b is In this expression, P is the number of admissible base pairs, for example 4, if we require strict Watson-Crick pairs, or 6, if we allow U-G pairs as well. The factor 16 ki−j comes from the need to pad each of the two sequences with k i − j nucleotides. Mismatches are allowed only on the positive strand H 1 of H. Let P [i] be the same as P [i] except that we disregard sequence constraints specified for position i in H 1 . Therefore P [i] − P [i] is the number of base pairs that form a potential mismatch at position i (satisfying the complementarity constraints). As before, we could enumerate all possible sets A of mismatch positions of size up to M M : To simplify the formula, we could again use an upper bound on the term (P , but in practice we approximate the actual value of P [i] by the empirical meanP and we approximate P [i] by the total number of base pairs P . Thus we obtain the following estimate: As before, N is the number of positions in H 1 where a mismatch can occur. Paired elements may also allow up to M P mispairs, where the two paired nucleotides do not form a valid base pair. Mispairs are treated similarly as mismatches; value P [i] is replaced by 16, which is the number of all possible pairs. Value P [i] is again approximated by its meanP . Unlike mismatches, mispairs can occur at all positions of the motif, including portions matching wild cards. The number of such positions is N = N − M I , where M I is the number of allowed insertions. We obtain the following bound Finally, if element H allows up to M I single nucleotide insertions, sequence of each strand has space for M I insertions, and therefore any of these 2M I positions not used as insertions need to be padded by arbitrary bases. For simplicity, we ignore the restriction that positions of two insertions should not be opposite in a helix. This gives us the final value X U used in computing the information content heuristic.

S4 DDEO Performance
Here, we demonstrate the performance of the DDEO heuristic on the hepatitis delta virus like ribozyme (HDV) motif ( Fig.4), which is the most complex motif that we used in our experiments. The motif contains four helical paired elements and six single stranded elements, organized in a double pseudoknot. Figures S1 and S2 show details of DDEO operation with the HDV ribozyme descriptor over five runs of RNArobo on the human genome. Both figures show data for all element orderings selected for the initial candidate set O. Fig.S1 shows that the heuristic score is not perfect; although many good orderings were included in the set, they are mixed with much worse orderings. Fig.S2 shows the number of samples necessary to eliminate a particular ordering. For orderings with a bad performance, we typically need very few samples (as few as two in some cases), while the orders with performance close to the optimum are tried many times, but since their performance is good, their repeated use does not increase the overhead significantly.

S5 Extended Description of Experiments
RNArobo uses the descriptor format of RNAbob (Eddy, 1996), but extends it with the option of allowing insertions, which is particularly useful for representing bulges in helical elements. Here we demonstrate the utility of our tool on several data sets containing functional ribozyme occurrences reported in literature (Webb et al., 2009;Webb and Lupták, 2011;Perreault et al., 2011). For every RNA motif, the same descriptor was used for search with both RNArobo and RNAbob. RNArobo was also run with a variant of each descriptor allowing insertions in helical regions. In several cases, these modified descriptors allowed us to discover additional known or putative ribozyme occurrences. The results are summarized in Table 1 and described in detail below.
The results of RNA motif searches often contain false positives, whose sequence satisfies the restrictions given in the descriptor but would not form the desired secondary structure if transcribed. To filter out such false positives, we have used our post-processing pipeline named Fold-Filter (Jimenez et al., 2012), which is provided as a part of the RNArobo 2.1 package. This pipeline predicts secondary structure and its stability using ViennaRNA package (Lorenz et al., 2011) and DotKnot (Sperschneider and Datta, 2010) and compares the results with user-defined thresholds. The last three experiments illustrate the use of this pipeline for improving specificity of the search.   Figure S1: The average time T x used by the first 40 triplets with the best heuristic score (ordered by the heuristic score from left to right) in five runs of RNArobo search in the human genome for the HDV ribozyme descriptor. The final element orderings selected in each run are highlighed.

S5.1 GTP aptamer class I
In this experiment, we search a compilation of sequences acquired through in vitro selections for GTP aptamers by Szostak and coworkers (Davis and Szostak, 2002). We have constructed a descriptor for the GTP aptamer class I encompassing both the conserved sequence and secondary structure facilitating ligandaptamer binding (Carothers et al., 2006). The descriptor for the search permitting insertions limits them to adenosine only. Both RNAbob and RNArobo discover nine unique sequences fitting the descriptor, but after allowing insertions, RNArobo finds one new instance. Since the library was selected for GTP binding, we assume that all these instances are functional.

S5.2 HHR type I in Yarrowia lipolytica.
The hammerhead ribozyme (HHR) motif is characterized by three helices anchored in a sequence conservedcatalytic core (Perreault et al., 2011). These structure descriptors (hhr1-4bp and hhr1-4bp-ins) were used to search through the yeast Yarrowia lipolytica CLIB122 (NC 006067.1-NC 006072.1) genome. This genome contains several type I HHRs (Perreault et al., 2011). Both RNAbob and RNArobo found a single HHR (Yli-1-3 position 1037945-1037850) using a descriptor with a strict requirement for at least four base pairs in each helix (hhr1-4bp). An RNArobo search using the same descriptor, but allowing single-nucleotide insertions in each of the three helices (hhr1-4bp-ins), increased the number of unique hits from one to fifteen. Among these hits were Yli-1-3, and also Yli-1-4 through Yli-1-11 ribozymes. The search allowing insertions was necessary for finding the eight additional ribozymes.
Decreasing the requirement in base pairing of helices from four to three base-pairs resulted in a total of four hits from both RNAbob and RNArobo searches (hhr1-3bp), including Yli-1-3 and Yli1-13, which was not found when helices were required to be at least four base pairs long. Allowing for single nucleotide insertions in any of the three helices of the HHR (hhr1-3bp-ins) significantly increased the number of hits returned by RNArobo from four to fifty-four, including the same known ribozymes (Yli-1-3 through Yli-1-11 and Yli-1-13). A sequence alignment of the output reveals that many sequences can be grouped into two main families. The first family includes the known "Yli" ribozymes, and the second includes previously unidentified putative ribozymes. Further analysis of this second family was done using the Fold-Filter script included with the RNArobo-2.0 package.  Each of the ten sequences in the second family, although not necessarily unique in sequence, is unique in the genomic location, making these ten putative ribozymes independent findings ( Figures S7 and S8). In this instance, RNArobo may have found a previously unidentified family of type I HHRs in the Y. lipolytica genome. This is significant in that no other sequence-conserved family of HHR is known to exist in this genome, and the new family is similar to the large family of HHRs in the Schistosoma mansoni genome.   Figure S7: A new previously unidentified putative family of type I HHRs in the Y. lipolytica genome. Each of the ten sequences in the family, although not necessarily unique in sequence, is unique in the genomic location, making these ten putative ribozymes independent findings.
C C A C 10 nts Figure S8: Putative secondary structure of the novel HHRs from Y. lipolytica.

S5.3 HHR type II in Bacillus cereus genome
Type II HHRs differ from type I only by the stem containing the 5' and 3' termini of the RNA. These descriptors (hhr2-3bp and hhr2-3bp-ins) were used to search for type II HHRs in the bacterium Bacillus cereus ATCC 14579 genome (NC 004722.1) (Ivanova et al., 2003), which has one known example of this ribozyme. Both RNAbob and RNArobo yielded a single sequence corresponding to the known HHR Bce-1-1. The RNArobo search allowing insertions (hhr2-3bp-ins) increased the number of hits from one to four. The Fold-Filter script, based on the Vienna RNAfold algorithm did not predict optimal secondary structure formation for any of the three additional hits, suggesting that these may not be bona fide ribozymes.
S5.4 HDV-like ribozyme in Anopheles gambiae chr2L sequence HDV-like ribozymes are characterized by a nested double-pseudoknot secondary structure with only six conserved nucleotides. The descriptors hdv-looseP4 and hdv-looseP4-ins were used to search chromosome 2L (NT 078265.2) of the mosquito Anopheles gambiae (Holt et al., 2002). The genome of this mosquito is known to harbor multiple families of HDV-like ribozymes, many of which map to retrotransposable elements (Webb et al., 2009;Ruminski et al., 2011). These ribozymes were first discovered by an RNAbob search followed by sequence similarity search, which found additional examples not fitting the descriptor (Webb et al., 2009). Both RNAbob and RNArobo found known drz-Agam-1-1 ribozyme and an additional putative ribozyme which displays characteristic HDV-like secondary structure. No other output from these searches appeared to have features characteristic of canonical HDV-like ribozymes. The output from RNArobo allowing for insertions produced one additional known ribozyme, drz-Agam-1-2, which was previously identified through sequence similarity to Agam-1-1.

S5.5 HDV-like ribozymes in Strongylocentrotus purpuratus genome
The genome of the purple sea urchin (Sodergren et al., 2006) is known to contain many HDV-like ribozymes, which can be grouped into at least four sequence families (Webb et al., 2009). Two different types of searches were conducted in this genome. The first was a loose search allowing region P4 to be variable (element s5 in hdv-looseP4), resulting in numerous hits, many of which were further determined by Fold-Filter to be false positives. Of the fifteen confirmed ribozymes found, most group into two sequence families with a single outlier possibly belonging to a separate family. The RNArobo search allowing insertions (hdv-looseP4-ins) found one additional positive hit, which aligns with one of the two families seen previously (see sequence alignments in Figure S15 and S16 and structure in Figure S17).
A second, more restrictive search in the S. purpuratus genome included the P4 region in the defined secondary structure (element h4 and s5 in hdv-stemP4), the helical region being an essential feature of the s 1 h1 s 2 h2 s 3 h2 ' s 4 h3 s 5 h3 ' s 6 h1 ' s 7 h1 0 : 0 * * * * * * NNN:NNN * * * * * * h2 0 : 0 * * * * * * NNN:NNN * * * * * * This search found one additional positive hit, which aligns with the latter of the two families seen in the previous search (Fig.S15). Figure S17: Putative secondary structure of the novel HDV from S. purpuratus. Table 2 in the main text contains running time comparison of RNArobo with several other tools. RNAbob can be directly compared, as RNArobo uses an extended version of the same descriptor format. We have created descriptors for RNAmotif and RNAmot that match RNArobo descriptors as closely as possible, but there are some small differences. For example, RNAMot does allow mispairs in helical regions, and the number of allowed wobble pairs and base mismatches is parametrized globally in the whole motif, rather than per-element, as in RNArobo. In this section, we list all descriptors used in our experiments.

S6.1 RNArobo descriptors
Most descriptors used in Table 2 for RNArobo and RNAbob are shown elsewhere in this supplement. In particular, GTP aptamer is in Figure S3, tRNA in Figure S42, HHR type I in Figure S5, HHR type II in Figure S9, HDV-like ribozyme with loose P4 region in Figure S11, and HDV-like ribozyme with structured P4 region in Figure S13. The remaining descriptors are shown below.

S7 Comparison with RaligNAtor
In this section, we provide additional details of the experimental comparison of RNArobo running times with RaligNAtor, a recent tool by Meyer et al. (2013). RaligNAtor provides two main search methods, lscan for online search, and lgslink that uses an index built in advance for the sequence database.
RaligNAtor can search for structural patterns, specified as a set of non-overlapping substructures that do not share any elements (for example a hairpin cannot be split to substructures). In each such substructure, users can allow indels, replacements, or mis-pairs, each with an individual penalty cost. However pseudoknotted structures are not possible.
We compared the running time on two structural motifs: Cripavirus internal ribosome entry site (IRES) contained in the RaligNAtor package, and the generalized tRNA motif from our experiments. Additionally, we conducted search for the IRES motif with no distortions allowed. We searched the motifs in the sequences from the Rfam 11 database, and in the whole genome of Drosophila melanogaster.
The structural pattern definition language used by RaligNAtor (Meyer et al., 2013) is not compatible with RNArobo, as RaligNAtor enables to set distortion parameters only for whole substructures, while RNArobo uses element-wise parametrization. Additionally, RNArobo does not support user-defined penalty cost schemes. To compare the running time, we approximately translated the motif description from Ralig-NAtor format to RNArobo format in case of IRES, and the other way around for tRNA (see descriptors in Figures S39, S40, S41, and S42). For both motifs, the RNArobo descriptor yields more unique hits in the searched sequences (see Table S1). Since matches generally slow down RNArobo search, the differences in the descriptors should not favor RNArobo.