 Proceedings
 Open Access
Efficient known ncRNA search including pseudoknots
 Cheng Yuan^{1} and
 Yanni Sun^{1}Email author
https://doi.org/10.1186/1471210514S2S25
© Yuan and Sun; licensee BioMed Central Ltd. 2013
 Published: 21 January 2013
Abstract
Background
Searching for members of characterized ncRNA families containing pseudoknots is an important component of genomescale ncRNA annotation. However, the stateoftheart known ncRNA search is based on contextfree grammar (CFG), which cannot effectively model pseudoknots. Thus, existing CFGbased ncRNA identification tools usually ignore pseudoknots during search. As a result, dozens of sequences that do not contain the native pseudoknots are reported by these tools. When pseudoknot structures are vital to the functions of the ncRNAs, these sequences may not be true members.
Results
In this work, we design a pseudoknot search tool using multiple simple substructures, which are derived from knotfree and bifurcationfree structural motifs in the underlying family. We test our tool on a contiguous 22Mb region of the Maize Genome. The experimental results show that our work competes favorably with other pseudoknot search methods.
Conclusions
Our substructure based tool can conduct genomescale pseudoknotcontaining ncRNA search effectively and efficiently. It provides a complementary pseudoknot search tool to Infernal. The source codes are available at http://www.cse.msu.edu/~chengy/knotsearch.
Keywords
 False Positive Rate
 Maize Genome
 Seed Sequence
 Pseudoknot Structure
 Parsing Algorithm
Background
It is already known that pseudoknots play important functions in telomerase RNA, tmRNA, rRNA, some riboswitches, some proteinbiding RNAs, Viral ribosomal frameshifting signals, etc [2]. Different research groups [3, 4] have shown that the pseudoknot structure in the telomerase RNA is essential for telomerase activity. Gilley and Blackburn [3] experimentally demonstrated that disruptions of the pseudoknot base pairing within the telomerase RNA from Tetrahymena thermophila prevent the stable assembly in vivo of an active telomerase. They further concluded that the pseudoknot topology rather than sequence is critical for an active telomerase. Similarly, biologists reported that the pseudoknots in tmRNA are highly important for protein biding, tmRNA maturation, and proper folding of the tRNAlike domain [5]. Currently, 26,704 sequences in 71 ncRNA seed families of Rfam 10.0 [6] contain pseudoknots. With the advances of sequencing technologies and structure predictions, more pseudoknot structures are expected to be revealed.
Because the functions of ncRNAs are determined by both the sequence and structure, successful ncRNA homology search tools must consider both sequence and structural conservations. Existing ncRNA search tools can be divided into two categories. One is commonly referred to "known ncRNA search", which aims to detecting homologs of ncRNAs with annotated secondary structures. The second category includes tools for identifying novel ncRNA genes. This work belongs to the first category and focuses on ncRNAs containing pseudoknots.
For pseudoknot free ncRNAs, the stateoftheart search method is based on stochastic contextfree grammars (SCFGs), which can accurately model the evolutionary changes of both the sequences and structures of a group of homologous ncRNAs. Commonly used general and specialized known ncRNA search tools such as Infernal [7], RSEARCH [8], and tRNAScanSE [9] are all based on SCFG. In conjunction with the ncRNA family database Rfam, Infernal has been successfully applied to classify query sequences into different types of ncRNA. However, SCFGs are not able to model pseudoknot. Thus, the implementations of SCFG by Infernal neglect pseudoknots in the structures. For example, although RF00023 (tmRNA) has four pseudoknots, its SCFG only models the knotfree structures, shown in Figure 1.B. As a result, Infernal could misclassify sequences as members of families containing pseudoknots. In addition, Infernal has high computational cost, limiting its usage in largescale data sets, such as those generated by the nextgeneration sequencing technologies.
More complicated grammars such as contextsensitive Grammars (CSGs) [10] exist to faithfully model pseudoknots. However, the computational cost of the parsing algorithms of a CSG is even higher than using a CFG. Besides CSGs, other grammars such as parallel communicating grammar systems [11], RNA pseudoknot grammars [12], tree adjoining grammars (TAGs) [13, 14], and multiple contextfree grammars [15] have been proposed to model pseudoknot structures. These work described the grammars and associated parsing algorithms. However, they have not been widely used in pseudoknot search in largescale databases. First, although the parsing algorithms are polynomial, their cubic or even higher time or memory complexity [15] limits their largescale applications. Second, these methods were designed for and tested on secondary structure derivation rather than homology search. In order to conduct largescale homology search, local parsing algorithms are needed. As there are no source codes or executable implementations of these grammars, it is not clear whether they can be automatically applied to known ncRNA search including pseudoknots.
In this work, we design an efficient pseudoknot search algorithm for all types of pseudoknots. Our method is based on a set of carefully chosen simple substructures (or substructures for short), which do not contain pseudoknots or bifurcations. The time complexity of the parsing and probability computation algorithms for an SCFG including the CYK, the inside, and the outside algorithm will be significantly reduced when the secondary structure does not contain any bifurcation [10, 16]. Thus, these simple substructures can be searched efficiently using existing implementations of SCFGs. For multiple substructures extracted from one ncRNA family, we choose a set of substructures according to their sizes and false positive (FP) rates in order to maximize the search performance. These chosen substructures will be used in a progressive search. Our experimental results show that our tool competes favorably with other pseudoknot search methods.
Related work
Brown and Wilson [17] proposed an RNA pseudoknot search method using intersections of SCFGs. Both Brown's method and our approach try to decompose pseudoknot into knotfree structures for SCFG modeling. There are two major differences. First, our substructures are not only knotfree, but also bifurcation free, which enables faster search. Second, while Brown and Wilson's method focused on the model construction and parsing algorithm, we focus on choosing an optimal set of substructures to optimize the search performance. The model construction and the parsing algorithms can be conveniently implemented using Infernal, which has gone through extensive testing.
Structural motifs similar to substructures have been used as filters to speed up Infernal. FastR [18] relies on stemloops ((k, w)stack) that do not contain bulge or interior loops to search for ncRNAs. Weinberg et al. [19] use more flexible structural motifs based on subCMs and profile HMMs for ncRNA classification. Smith [16] used a decision tree to organize partial SCFG models for fast ncRNA search. Currently, these filters are only designed and tested for speeding up SCFG search.
Available pseudoknot search tools include RNAv [20] and RNATOPS [21]. RNATOPS designs a graph model for RNA pseudoknots and solves the structure sequence alignment by graph optimization. RNAv is a profile based RNA secondary structure variation search program that detects distant ncRNA structural homologs, which might be missed by RNATOPS.
The chain filter designed by Zhang et al. [22] consists of a collection of short conserved words in an ncRNA family. In our work, we use a collection of simple substructures for pseudoknot search. Similar to Zhang et al.'s work, we find that using a collection of simple structures can achieve a good tradeoff between sensitivity and false positive rate during search.
Approach
There are two components in the method. The first component is the design of a set of substructures to represent an ncRNA family. The second component is a progressive search strategy using the designed substructures. Different regions of an ncRNA sequence have different sequence and structural conservations. Wellconserved structural and sequence motifs tend to yield better search performance than poorly conserved motifs. Our approach sorts substructures extracted from different regions according to their lengths and predicted FP rates in order to choose a set of substructures with the optimal search performance.
Substructure derivation
We describe a method to systematically extract all simple substructures from a pseudoknot. In the first step, all stems are extracted and sorted in increasing order of their starting positions (i.e. 5' end of the outmost base pair in the stem). Second, we build a bit table R of size N by N for N stems extracted from the first step. For each cell R[i, j], if stem i and stem j are nested, R[i, j] = 1; otherwise, R[i, j] = 0. Table R provides us information about whether given stems can form one substructure. Given the stem set and their relationship table R, we use pseudocode in Algorithm 1 to extract all simple substructures. In the pseudocode, H^{ x } is the set of substructures containing x stems. Thus, the union of H^{ x } for x = 1 to N consists of all simple substructures for a given secondary structure. The number of substructures depends on the number of nested stems. Suppose the average number of nested stems inside a stem is n. The total number of substructures is O(N + N 2^{ n }).
Algorithm 1 ExtractSubstructures Input: a secondary structure containing pseudoknots Output: all simple substructures
1: for each stem i = 1 to N do
2: /* h: a substructure containing a set of stems */
3: h = {i}
4: H^{1} = H^{1} ∪{h}
5: end for
6: for L = 2 to N do
7: H^{ L } = Ø
8: for each substructure h ∈ H^{L1}do
9: for each stem i ∉ h do
10: /* h[i] is the ith stem in a substructure h */
11: if R[h[1], i] and R[h[2], i] ... and R[h[L1], i] then
12: /* construct a new substructure h' */
13: h' = h∪{i}
14: H^{ L } = H^{ L } ∪{h'}
15: end if
16: end for
17: end for
18: end for
19: output all substructures H = H^{1} ∪H^{2} ∪ ... ∪H^{ N }
Algorithm 1 only outputs the combination of stems. For each stem (or stem set) in a substructure, we add loop and flanking regions using the following three rules. Let the 5' and 3' ends of the outmost base pair in a substructure be I_{5} and I_{3}, respectively. Thus, I_{5} <I_{3}.

Rule 1: Add all singlestranded regions including bulge and internal loops between I_{5} and I_{3}.

Rule 2: Except the base pairs inside the chosen stems in a substructure, all other base pairs will be treated as singlestranded regions.

Rule 3: Extend the flanking singlestranded regions to the left of I_{5} and to the right of I_{3} until the first base pair in other substructures.
Search performance of different substructures
Sort substructures according to their Evalues
There are two methods to calculate the FP rates of substructures. Theoretically, by assuming a background model for random sequences and applying the CYK algorithm [10], we can directly calculate the probability that a random sequence matches an SCFG model. Empirically, we can apply the SCFGs to a large annotated sequence database and record the number of FP matches. However, as it is more important to compare the FP rates of different substructures than knowing their exact values, it is not necessary to directly calculate FP rates. By assuming that the SCFG alignment scores for random sequences follow an exponential distribution, as implemented by Infernal, we can use Evalues of the designed score cutoffs to sort all substructures.
In the Evalue computation, μ and λ are parameters trained in Infernal. sc is the score for which one needs to calculate Evalue. db_{ target } is the size of target database. db_{ random } is the number of random sequences generated for curve fitting. At last, randhit is the number of random sequences found by the covariance model. We can directly obtain μ and λ from each calibrated covariance model, which is built for a substructure. With these two parameters available, we can use the above equation to compute Evalues for given scores.
The order of Evalues is highly consistent to the order of number of the FP hits.
substructure  Evalue  FP hits  substructure  Evalue  FP hits 

RF00373_part2  1.71e+03  4894  RF00373_part1  7.33e+02  2606 
RF00373_part5  7.30e02  41  RF00373_part3  3.58e02  39 
RF00373_part4  3.18e06  39  RF00373_part6  5.29e09  37 
RF00373_part8  4.52e09  39  RF00373_part7  3.40e15  34 
Choose substructures for progressive search
During a progressive search based on multiple substructures, the final sensitivity is bounded by the lowest sensitivity of all substructures. The final search time and FP rates heavily depend on the order of applying these substructures. Let the final array of substructures be $\mathcal{S}\mathcal{U}\mathcal{B}=\left({\mathcal{H}}_{1},...,{\mathcal{H}}_{i},...,{\mathcal{H}}_{n}\right)$, where ${\mathcal{H}}_{i}$ will be applied before ${\mathcal{H}}_{j}$ if i <j. Let the size of the original database be L. For a substructure ${\mathcal{H}}_{i}$, let t_{ i } and fp_{ i } be its search time per hit and FP rate, respectively. The final FP rate is bounded by $\prod _{i=1}^{n}f{p}_{i}$. The final search time is roughly $T=L\sum _{i=1}^{n}{t}_{i}(\prod _{j=1}^{i1}f{p}_{j})$, where $L\prod _{j=1}^{i1}f{p}_{j}$ is roughly the search space for the substructure ${\mathcal{H}}_{i}$. Minimizing T requires the accurate computation of t_{ i } or quantification of the relationship between t_{ i } and fp_{ i }, which is not known as a priori. Although Infernal provides estimated running time, it can be quite different from the true running time. According to the equations, it is clear that we should apply short substructures with small FP rates before long substructures with high FP rates. Thus we develop a greedy algorithm to generate a set of substructures for progressive search based on our empirical observations.
We split substructures into short group and long group, which contain short and long substructures respectively. For each group of substructures, we sort the substructures according to their Evalues and apply a greedy algorithm to choose a set of substructures for search. The main steps of the greedy algorithm are outlined below, starting from the short group:
1. In each iteration, choose the substructure with the smallest Evalue. Remove it and append it to the final substructure list $\mathcal{S}\mathcal{U}\mathcal{B}$.
2. Remove any remaining substructure in both groups that only contains stems in this substructure.
3. Repeat the first step until all stems are covered by one chosen substructure or the Evalues of all remaining substructures are bigger than a predetermined cutoff (default is 1).
If $\mathcal{S}\mathcal{U}\mathcal{B}$ has not included all stems, we apply the same process to the long group and append the chosen substructures to $\mathcal{S}\mathcal{U}\mathcal{B}$. We require all stems covered by the chosen substructures in order to ensure the representation of the annotated pseudoknot structure. It is possible that this constraint will exclude homologous ncRNAs that lack annotated stem loop structures. Currently, we use size 150 as the threshold to divide substructures into the short and the long group.
Implementation
For each substructure, we train an SCFGbased model based on the corresponding alignment in the training data using Infernal. Let the SCFGs trained from n substructures of an ncRNA family $\mathcal{S}\mathcal{U}\mathcal{B}=\left({\mathcal{H}}_{1},...,{\mathcal{H}}_{i},...,{\mathcal{H}}_{n}\right)$ be $\Pi =\left({\mathcal{M}}_{1},...,{\mathcal{M}}_{i},...,{\mathcal{M}}_{n}\right)$, where ${\mathcal{M}}_{i}$ represents a single SCFG. A sequence can be classified into the corresponding family if the following conditions are satisfied. First, the sequence contains matches to all designed SCFGs in Π. SCFG match will be defined in the following text. Second, for every pair of strings that match two SCFGs, their position relationship must be consistent with the annotated relationship between two SCFGs in the underlying ncRNA family. There are three types of position relationship between two substructures: parallel, nested, and crossover. Crossover indicates existence of pseudoknots.
We determine SCFG match using score thresholds. For all sequences in the training set, its alignment score with a given SCFG is computed. The minimum score of all the seed sequences is used as the score threshold. This score cutoff is similar to the NC (trusted cutoff) bit score thresholds used in HMMER [23] or Infernal. When the training data contains a good representation of the family member sequences, the computed score threshold can ensure a high sensitivity during homology search. If the training set only contains close homologs of this ncRNA family, the designed cutoff may be too high for remotely related homologs.
Experimental results
In order to test the performance of our tool for pseudoknot search in sequence databases, we conducted two experiments. First, we examined the automatically classified pseudoknot sequences in Rfam. Second, we applied it to part of the Maize genome. On the same data set, we compared our tool with RNAv, RNATOPS, and Infernal.
Pseudoknot sequences in Rfam
Sequences that do not contain annotated pseudoknots and thus may not be real members.
ID  seqs without knots/num of seqs  ID  seqs without knots/num of seqs  ID  seqs without knots/num of seqs  ID  seqs without knots/num of seqs 

RF00009  37/500  RF00010  3/3864  RF00011  26/460  RF00023  53/2871 
RF00024  56/233  RF00028  2587/39045  RF00030  47/470  RF00041  2/151 
RF00140  81/524  RF00176  37/64  RF00216  25/126  RF00233  22/76 
RF00259  78/124  RF00261  43/78  RF00499  1/16  RF00523  2/5177 
RF00622  1/94  RF01050  3/60  RF01072  21/271  RF01073  1/7006 
RF01086  15/1093  RF01087  1/31  RF01089  4/25  RF01096  2/45 
Data set preparation
We created a simulated data set based on a contiguous 22Mb region of the Maize Genome [25]. The annotation of the 22Mb region does not contain any hit to the 71 pseudoknot families in Rfam. In order to evaluate the sensitivity of pseudoknot search tools, we randomly chose 1,586 out of 26,704 seed sequences from 71 pseudoknot families and inserted them in the 22Mb region. The remaining seed sequences are used as the training data. In order to examine the FP rate of SCFGbased tools, we also created 1,586 sequences without pseudoknots. Specifically, for each of the 1,586 seed sequences, we altered the bases to disrupt the base pairs that can form pseudoknots. Similarly RNAmotif is applied again to ensure these sequences lose the annotated pseudoknot structure. These modified 1,586 sequences and the original 22Mb region of the Maize Genome constitute the negative training data. Any hit to them is an FP hit. Note that by changing the bases, the modified sequences might share lower sequence similarity to the trained model and thus pose an easier case for all tools. Even so, our experimental results still show that different tools exhibit highly difference performance on this data set. Thus, we feel this data set is a reasonable test set.
There are two major advantages of using this simulated data set for testing pseudoknot search tools. First, as the 22Mb region of the Maize genome does not harbor any reported ncRNA that contains pseudoknots, we can measure the empirical FP rates of pseudoknot search tools with higher reliability than using simulated sequences, which are usually generated using a simple i.i.d. model or loworder Markov model. In particular, the Maize genome contains a high percentage of repeats and lowcomplexity regions, which could not be simply simulated and can pose a challenge for ncRNA search as warned by the Rfam website (http://rfam.sanger.ac.uk/). Second, using thousands of seed members of the pseudoknot families provides us adequate test data for evaluating the sensitivity.
Besides using the seed sequences of Rfam, we also considered another pseudoknot sequence database Pseudobase [26]. This database contains 304 RNA sequences with pseudoknot structures. A majority of them are substrings of Rfam seed sequences. Thus, we choose to use Rfam seed sequences as the true label.
Results and comparisons
We report the FP hits instead of the FP rates for two reasons. First, the condition negative set is family specific and thus is the same for all search tools for a given family. Second, the size of the condition negative set is mainly determined by the size of the genome minus the size of all true pseudoknot sequences. For a large genomic sequence, the FP rate becomes very small and cannot reflect the difference between different tools.
On the same dataset, we run RNAv, RNATOPS, and Infernal 1.0.2. Of the three, RNAv and RNATOPS are designed for pseudoknot search. For Infernal and substructure, no hidden Markov modelbased filtration was used in order to maximize the sensitivity. Other parameters were set as default for Infernal. We used the default parameters to run RNAv and RNATOPS. All experiments were run on the main cluster of the High Performance Computing Center on campus (http://www.icer.msu.edu/?q=hpcc). Each experiment was allocated four CPU days at most. There are 65 families and 31 families that failed RNAv and RNATOPS, respectively. The search jobs for those families were killed by the cluster after four CPU days. No results were produced. Thus we could not report the results for those families. RNAPTOPS output results for 22 families by the end of the allocated time.
Sensitivity, FP hits, and running time comparison between RNAv, RNATOPS, Infernal, and substructure.
RNA fam ID  sen  FP  time RNAv  sen  FP  time RNATOPS  sen  FP  time Substructure  sen  FP hits  time INFERNAL 

RF00009  1.0  5  01:47:37  1.0  38  26:16:07  
RF00010  0.58  95  00:18:47  0.97  318  17:54:31  
RF00011  0.84  25  00:06:51  0.97  179  09:09:52  
RF00023  0.4  1  00:06:54  1.0  180  13:40:31  
RF00024  0.95  24  00:06:33  0.81  86  20:36:42  
RF00028  0.83  6  22:30:56  0.72  37  79:05:16  
RF00030  0.38  26  02:35:01  0.98  87  83:37:31  
RF00041  0.95  0  00:10:37  1.0  64  01:27:52  
RF00094  0.88  0  00:09:21  1.0  35  00:54:20  
RF00140  0.97  0  01:05:08  1.0  33  01:52:09  
RF00165  0.21  4  4 days  1.0  0  00:22:10  1.0  14  00:32:25  
RF00176  1.0  58077  19:54:40  1.0  0  00:05:48  1.0  21  00:50:54  
RF00216  0.87  0  00:03:03  1.0  30  04:42:29  
RF00233  0.26  0  4 days  0.96  0  00:09:06  1.0  29  00:47:38  
RF00259  1.0  0  00:05:41  1.0  5  02:09:52  
RF00261  1.0  0  00:13:53  1.0  20  02:50:11  
RF00373  0.92  27  01:35:35  0.95  363  14:15:43  
RF00381  0.38  30  4 days  1.0  0  00:17:10  1.0  15  00:33:42  
RF00390  1.0  763  4 days  1.0  0  00:05:21  1.0  6  00:07:35  
RF00458  1.0  0  00:09:37  1.0  10  02:18:47  
RF00499  1.0  0  00:09:51  1.0  115  01:33:43  
RF00505  0.2  2  4 days  1.0  0  00:32:27  1.0  5  00:29:55  
RF00507  0.41  7  4 days  0.95  0  00:34:44  1.0  23  00:52:44  
RF00523  0.29  160  4 days  0.95  24  00:20:31  1.0  145  00:19:24  
RF00622  1.0  0  00:05:15  1.0  14  00:42:40  
RF01050  1.0  0  00:41:32  1.0  13  39:22:21  
RF01072  0.52  273  4 days  0.96  0  00:08:37  1.0  30  00:10:13  
RF01073  1.0  196631  13:13:32  0.11  3  4 days  1.0  0  00:18:36  1.0  13  00:29:04 
RF01074  0.5  91  4 days  1.0  0  00:06:59  1.0  10  00:15:00  
RF01075  1.0  0  00:07:59  1.0  7  01:00:46  
RF01076  1.0  139249  16:20:29  1.0  0  00:20:36  1.0  5  00:35:33  
RF01077  1.0  0  00:53:32  1.0  4  00:37:23  
RF01078  1.0  0  00:12:38  1.0  3  00:26:04  
RF01079  1.0  333  4 days  1.0  0  00:07:37  1.0  3  00:16:01  
RF01080  0.5  135  4 days  1.0  0  00:08:04  1.0  110  00:13:41  
RF01081  0.67  284  4 days  1.0  0  00:06:47  1.0  3  00:08:44  
RF01082  0.5  2934  4 days  1.0  0  00:05:47  1.0  4  00:09:13  
RF01083  1.0  3002  4 days  0.67  1  00:04:34  1.0  7  00:07:05  
RF01084  1.0  0  00:10:46  1.0  8  01:53:25  
RF01086  1.0  11  05:18:38  1.0  13  05:39:23  
RF01087  0.5  3  4 days  1.0  0  01:19:41  1.0  12  01:37:01  
RF01088  1.0  0  00:39:09  1.0  4  00:37:14  
RF01089  0.33  1  4 days  1.0  3  01:03:09  1.0  20  01:21:27  
RF01090  0.43  4  4 days  1.0  0  00:23:11  1.0  8  00:36:35  
RF01091  1.0  0  00:13:06  1.0  4  00:28:51  
RF01092  1.0  165990  10:58:02  1.0  0  4 days  1.0  0  00:17:57  1.0  15  00:30:08 
RF01093  0.42  67  4 days  1.0  0  00:13:59  1.0  23  00:29:56  
RF01094  1.0  0  00:52:46  1.0  3  01:10:47  
RF01095  1.0  0  00:10:56  1.0  2  00:27:12  
RF01096  1.0  166314  16:44:20  0.5  1  4 days  1.0  0  00:23:04  1.0  2  00:24:45 
RF01097  0.25  1  4 days  1.0  0  00:12:18  1.0  4  00:22:09 
There is no significant difference in the sensitivity between Infernal and substructurebased tool when the average sequence length in a family is not too long. Infernal has better sensitivity on longer and more complicated RNA families including RF00010, RF00011, RF00023, and RF00030. The major reason behind our worse sensitivity on the long families is that we use substructure that cover every stem. Thus, we only classify sequences that have all characterized stems from the underlying structure. However, some remote homologs may lose base pairs in stems during evolution. Thus while we guarantee to find sequences that have the same structures as the annotated pseudoknots, we can miss some homologs, leading to lower sensitivity for some families.
Conclusion
Although Infernal is highly sensitive in known ncRNA search, caution must be taken when applying Infernal to ncRNA families containing pseudoknots. In this work, we designed a pseudoknot search method based on a set of carefully chosen substructures. These substructures do not contain pseudoknots or bifurcations. SCFGs can be conveniently built on them and searched with high efficiency. In order to minimize the overall FP rate and the running time, we sorted substructures according to their lengths and their Evalues for designed trusted cutoff (NC) bit score thresholds. We designed a greedy algorithm to choose a set of substructures and applied the progressive search to minimize search time. Our experimental results showed that our tool competes favorably with RNAv and RNATOPs, both of which have been used for pseudoknot search in large databases. This work provides a complementary pseudoknot search tool to existing SCFGbased knotfree ncRNA search methods.
Currently our tool only reports homologous ncRNAs with the same number of characterized stems as the training data. As a result, some true homologs that have lost one or multiple stems will be ignored. As part of the future work, we plan to incorporate available RNAseq data for remote homology search.
Declarations
The publication costs for this article were funded by NSF DBI0953738 and IOS1126998.
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 2, 2013: Selected articles from the Eleventh Asia Pacific Bioinformatics Conference (APBC 2013): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S2.
Declarations
Acknowledgements
This work was partially supported by the NSF grants DBI0953738 and IOS1126998.
Authors’ Affiliations
References
 GriffithsJones S: Annotating Noncoding RNA Genes. Annual Review of Genomics and Human Genetics. 2007, 8: 279298. 10.1146/annurev.genom.8.080706.092419.View ArticlePubMedGoogle Scholar
 Staple DW, Butcher SE: Pseudoknots: RNA Structures with Diverse Functions. PLoS Biology. 2005, 3 (6): e21310.1371/journal.pbio.0030213.PubMed CentralView ArticlePubMedGoogle Scholar
 Gilley D, Blackburn EH: The telomerase RNA pseudoknot is critical for the stable assembly of a catalytically active ribonucleoprotein. PNAS. 1999, 96 (12): 66216625. 10.1073/pnas.96.12.6621.PubMed CentralView ArticlePubMedGoogle Scholar
 Chen JL, Greider CW: Functional analysis of the pseudoknot structure in human telomerase RNA. PNAS. 2005, 102 (23): 80808085. 10.1073/pnas.0502259102.PubMed CentralView ArticlePubMedGoogle Scholar
 Wower IK, Zwieb C, Wower J: Contributions of Pseudoknots and Protein SmpB to the Structure and Function of tmRNA in transTranslation. the Journal of Biological Chemistry. 2004, 279 (52): 5420254209. 10.1074/jbc.M410488200.View ArticlePubMedGoogle Scholar
 Gardner P, Daub J, Tate J, Nawrocki E, Kolbe D, Lindgreen S, Wilkinson A, Finn R, GriffithsJones S, Eddy S, Bateman A: Rfam: updates to the RNA families database. Nucleic Acids Research. 2008, 37 (Database): D136D140.PubMed CentralView ArticlePubMedGoogle Scholar
 Nawrocki EP, Kolbe DL, Eddy SR: Infernal 1.0: Inference of RNA alignments. Bioinformatics. 2009, 25: 13351337. 10.1093/bioinformatics/btp157.PubMed CentralView ArticlePubMedGoogle Scholar
 Klein RJ, Eddy SR: RSEARCH: finding homologs of single structured RNA sequences. BMC Bioinformatics. 2003, 4: 4410.1186/14712105444.PubMed CentralView ArticlePubMedGoogle Scholar
 Lowe T, Eddy SR: TRNAscanSE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 1997, 25: 95564.PubMed CentralView ArticlePubMedGoogle Scholar
 Durbin R, Eddy SR, Krogh A, Mitchison G: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. 1998, UK: Cambridge University PressView ArticleGoogle Scholar
 Cai L, Malmberg RL, Wu Y: Stochastic modeling of RNA pseudoknotted structures: a grammatical approach. Bioinformatics. 2003, 19 (Suppl. 1): i66i73.View ArticlePubMedGoogle Scholar
 Rivas E, Eddy SR: The language of RNA: a formal grammar that includes pseudoknots. Bioinformatics. 2000, 16 (4): 334340. 10.1093/bioinformatics/16.4.334.View ArticlePubMedGoogle Scholar
 Uemura Y, Hasegawa A, Kobayashi S, Yokomori T: Tree adjoining grammars for RNA structure prediction. Theoretical Computer Science. 1999, 210 (2): 277303. 10.1016/S03043975(98)000905.View ArticleGoogle Scholar
 Matsui H, Sato K, Sakakibara Y: Pair stochastic tree adjoining grammars for aligning and predicting pseudoknot RNA structures. Bioinformatics. 2005, 21 (11): 26112617. 10.1093/bioinformatics/bti385.View ArticlePubMedGoogle Scholar
 Kato Y, Seki H, Kasami T: RNA Pseudoknotted Structure Prediction Using Stochastic Multiple ContextFree Grammar. IPSJ Digital Courier. 2006, 2: 655664.View ArticleGoogle Scholar
 Smith JA: RNA Search with Decision Trees and Partial Covariance Models. IEEE/ACM Trans Comput Biol Bioinform. 2009, 6 (3): 517527.PubMed CentralView ArticlePubMedGoogle Scholar
 Brown M, Wilson C: RNA pseudoknot modeling using intersections of stochastic context free grammars with applications to database search. Pac Symp Biocomput. 1996, 109125.Google Scholar
 Zhang S, Haas B, Eskin E, Bafna V: Searching Genomes for Noncoding RNA Using FastR. IEEE/ACM Trans Comput Biol Bioinform. 2005, 2: 36679. 10.1109/TCBB.2005.57.View ArticlePubMedGoogle Scholar
 Weinberg Z, Ruzzo W: Exploiting conserved structure for faster annotation of noncoding RNAs without loss of accuracy. Bioinformatics. 2004, 20 (suppl. 1): i33440.View ArticlePubMedGoogle Scholar
 Huang Z, Malmberg R, Mohebbi M, Cai L: RNAv: Noncoding RNA secondary structure variation search via graph homomorphism. CSB Conference Proceedings. 2010, CA, USA, 5669.Google Scholar
 Huang Z, Wu Y, Robertson J, Feng L, Malmberg RL, Cai L: Fast and accurate search for noncoding RNA pseudoknot structures in genomes. Bioinformatics. 2008, 24 (20): 22812287. 10.1093/bioinformatics/btn393.PubMed CentralView ArticlePubMedGoogle Scholar
 Zhang S, Borovok I, Aharonowitz Y, Sharan R, Bafna V: A sequencebased filtering method for ncRNA identification and its application to searching for riboswitch elements. Bioinformatics. 2006, 22: e55765. 10.1093/bioinformatics/btl232.View ArticlePubMedGoogle Scholar
 Eddy S: HMMER  biosequence analysis using profile hidden Markov models. 2007, [Http://hmmer.janelia.org/]Google Scholar
 Macke T, Ecker D, Gutell R, Gautheret D, Case D, Sampath R: RNAMotif  A new RNA secondary structure definition and discovery algorithm. Nucleic Acids Research. 2001, 29: 47244735. 10.1093/nar/29.22.4724.PubMed CentralView ArticlePubMedGoogle Scholar
 Wei F, Stein JC, Liang C: Detailed Analysis of a Contiguous 22Mb Region of the Maize Genome. PLoS Genet. 2009, 5 (11): e100072810.1371/journal.pgen.1000728.PubMed CentralView ArticlePubMedGoogle Scholar
 van Batenburg FHD, Gultyaev AP, Pleij CWA, Ng J, Oliehoek J: PseudoBase: a database with RNA pseudoknots. Nucleic Acids Research. 2000, 28: 201204. 10.1093/nar/28.1.201.PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.