Shape based indexing for faster search of RNA family databases
 Stefan Janssen^{1},
 Jens Reeder^{1} and
 Robert Giegerich^{1}Email author
https://doi.org/10.1186/147121059131
© Janssen et al; licensee BioMed Central Ltd. 2008
Received: 30 October 2007
Accepted: 29 February 2008
Published: 29 February 2008
Abstract
Background
Most noncoding RNA families exert their function by means of a conserved, common secondary structure. The Rfam data base contains more than five hundred structurally annotated RNA families. Unfortunately, searching for new family members using covariance models (CMs) is very time consuming. Filtering approaches that use the sequence conservation to reduce the number of CM searches, are fast, but it is unknown to which sacrifice.
Results
We present a new filtering approach, which exploits the family specific secondary structure and significantly reduces the number of CM searches. The filter eliminates approximately 85% of the queries and discards only 2.6% true positives when evaluating Rfam against itself. First results also capture previously undetected noncoding RNAs in a recent human RNAz screen.
Conclusion
The RNA s hape i ndex f ilter (RNAsifter) is based on the following rationale: An RNA family is characterised by structure, much more succinctly than by sequence content. Structures of individual family members, which naturally have different length and sequence composition, may exhibit structural variation in detail, but overall, they have a common shape in a more abstract sense. Given a fixed release of the Rfam data base, we can compute these abstract shapes for all families. This is called a shape index. If a query sequence belongs to a certain family, it must be able to fold into the family shape with reasonable free energy. Therefore, rather than matching the query against all families in the data base, we can first (and quickly) compute its feasible shape(s), and use the shape index to access only those families where a good match is possible due to a common shape with the query.
Keywords
Background
ncRNA
Computational screens [1–3] predict thousands of potentially conserved secondary structures in the human genome. Similar screens in Yeast and Nematodes [4] also produced thousands of potential noncoding RNAs (ncRNAs). Some were already identified by sequence comparison as members of the known RNA families, but the meaning of the majority remains unclear. With functional RNA, structure is often more important than primary sequence content. Thus, a BLAST screen against RNA sequence databases is not enough – the structure needs to be taken into account when searching for known relatives of a query RNA.
Rfam and its usage
The Rfam database [5] is a constantly growing data source for ncRNAs. The current release (Rfam 8.0, February 2007. During the submission process a new version of Rfam (8.1, 607 families) appeared.) contains 273 989 annotated sequences grouped in 574 families. An Rfam family contains two secondary structure annotated, multiple alignments, a so called seed alignment and a larger full alignment. A consensus structure is also provided (annotated as SS_cons), with no guarantee that family members actually fold into this consensus. The seed alignment usually is handcurated and contains only validated sequences from the literature and other databases. From the seed alignment, a probabilistic model is learned, which is then used to annotate new members of this family. All members with a good CM (covariance model) score build the full alignment.
Researchers today use the information stored in Rfam routinely to annotate newly sequenced bacterial genomes, and with caution also for more complex eukaryotic genomes. The major obstacle researchers encounter hereby is the high computational complexity, which makes the annotation process very time consuming or even impossible. Filtering techniques are therefore necessary to speed up the analysis. In the next sections we will shortly review the probabilistic models used for searching and two sequence based filtering techniques.
Searching with covariance models
Covariance models (CMs) [6] are probabilistic models, incorporating family specific structural information, much like profile hidden Markov models (HMM) in (linear) sequence analysis do. The main advantage of a CM over a HMM is, that it can model the long range interactions we see with RNA base pairing.
CMs basically are profile stochastic context free grammars (SCFG). Each base pair and each unpaired residue is represented by one state. States are arranged in a treelike structure that mirrors the treelike consensus structure of an RNA family. Additional states model insertions and deletions of bases, differing from the consensus. Transitions from one state to another in a CM are modeled by production rules, each having certain transition and emission probabilities, learned from a multiple structural alignment. Given a CM and a query sequence, the algorithmic problem is to find the path through the CM that emits the sequence with the highest probability. This can be done efficiently using dynamic programming with a CYKlike parsing algorithm in O(LN^{3}), where L is the target sequence length and N the window size. The programs to build and search with CMs are bundled in the software package Infernal.
Despite some recent improvements [7], many Rfam families still require more than an hour CPU time per Mb. Consequently, a large eukaryotic genome cannot be annotated in total within reasonable time. The search procedure has to be restricted to the most promising regions. This is where the need for efficient filters arises.
Filtering
Basic concepts of filtering
Usually, filters strive to rule out as much of the input data as possible, without discarding too many positive cases. Of course, these are two competing goals and in practice a tradeoff has to be made. We define the ratio of the input data that passes the filter and the total input data as the filtration ratio. A good filter has a low filtration ratio, a useless one a ratio of 1. The sensitivity is defined as the ratio of positives passing the filter and all positives. A special case are rigorous filters, which have a perfect sensitivity. They never filter out a true positive.
BLASTfilter
The curators of Rfam provide a simple and fast BLAST based filtering heuristic [8]: All BLAST hits with a Pvalue < 10 to any member of the seed alignments are extracted and a family specific window size is added to both ends of the hit. Only the much smaller subsequences are then analyzed with an expensive CM search. Despite its simplicity, the filter has been used to add sequences to Rfam. For each new Rfam release, a portion of the EMBL nucleotide database is scanned this way. The resulting CM hits, together with the seed members, then constitute the full alignment.
The more diverse in sequence the seed alignment is, the more likely distantly related family members will be recognized. Also, with each new family member the filter becomes more sensitive, yet less specific. However by construction, an RNA perfectly in agreement with the consensus structure but very dissimilar in sequence will be discarded. At the moment, it is unclear how many homologues are hidden in the databases, overlooked by the BLAST filter.
HMMfilter
Recently, another filtering approach was introduced by Weinberg and Ruzzo. They suggest profile HMMs, automatically built from the CM, as a prefilter to the CM search. In [9], the authors demonstrate how to convert a CM into a rigorous HMM filter. First, the CM has to be linearized. For a base pair state s in the CM, emitting columns i and j in the multiple alignment, two states in the HMM are introduced, namely at position i and j. Now, assume the CM emits only CG and GC base pairs in state s, then the HMM would emit a C or G at position i and another C or G at position j. Of course the base pair condition cannot be modeled this way, but nevertheless the sequence information is still available. Second, the scores of the HMM are cleverly chosen, such that the HMM Viterbi score is an upper bound for the CM score. Thus any subsequence scoring below a certain threshold can be safely discarded. For many Rfam families the filtration ratio is < 0.01, thus making the HMM scan itself the run time determining step. The HMM searches scan the database approximately 200 times faster than the original CM searches.
However, for some families the rigorousness requirement prevents a significant speed up of the resulting HMM over the CM. Also the filtering efficiency may suffer from the attempt to capture even the most excentric family member. A way out of this dilemma are heuristic filters (called MaximumLikelihood (ML) heuristics in [10]), which sacrifice rigorousness for speed. The ROClike curves in [10] and the pink data point in the lower right of Figure 1 give an impression of this fact. The current Infernal release 0.81 provides a variant of the MLheuristic.
A new approach: shape based filtering
Our idea of RNA s hape i ndex f iltering (RNAsifter) is based on the following rationale: An RNA family is characterised by structure, much more succinctly than by sequence content. Structures of individual family members, which naturally have different length and sequence composition, may exhibit structural variation in detail, but overall, they have a common shape in a more abstract sense. Given a fixed release of the database, we can compute these abstract shapes for all families. This is called a shape index.
If a query sequence belongs to a certain family, it must be able to fold into the family shape with reasonable free energy. Therefore, rather than matching the query against all families in the data base, we can first (and quickly) compute its feasible shape(s), and use the shape index to access only the families where a match is possible due to a common shape with the query.
In an ideal world, this results in a rigorous filtering algorithm (rather than a heuristics): It reduces the number of searches (and therefore computation time), while no potential match is missed. In reality, the outcome depends on many details – the structural homogeneity of families, the type of shapes we compute, and so forth. Moreover, the general idea gives ample room for alternative implementations. We will explore some of these in the sequel, and end up with a quite effective (although not perfect) parameter set that finds 97.4% of all hits performing only 15.0% of the queries.
A review of abstract shapes of RNA
Abstract shapes of RNA were introduced in [11]. We give a short review, avoiding a fully formal treatment. When we speak about functional RNA classes, we do not refer to concrete structures – we employ abstraction. A tRNA has a cloverleaf structure, a microRNA precursor is a lengthy hairpin, oxyS RNA has three adjacent hairpins. Obviously, the most important structural characteristic is the specific arrangement of RNA helices, governed by the two principles of adjacency and embedding. The cloverleaf, for example, is a helix which embeds three helices adjacent to each other. Sometimes, we want to be less abstract. The iron responsive element, for example, is a small hairpin with a bulgedout cytosin that is essential. The technique of abstract shape analysis [11] formalises the concept of shapes and teaches RNA folding programs to compute with these shapes. This is done in a mathematically precise sense, with no heuristics involved. The program RNAshapes computes the k ≥ 1 nearoptimal structures which have different shapes, thus giving a concise overview over a molecule's structural inclinations. These structures are called sh ape rep resentatives, shrep for short, as each is an optimal structure with respect to its shape.
Probabilistic shape analysis [12] computes Boltzmann statistics shapewise, giving us the accumulated probability of all ways in which the given sequence can fold into (say) a cloverleaf shape. Being a cloverleaf shape with (say) 80% probability is much more handsome information than traditional MFE folding, as this is independent of sequence length and composition, and hence comparable between different sequences. Formally, a shape abstraction is a mapping from concrete RNA structures to abstract shapes. Concrete structures are modeled as trees, as frequently done in the RNA bioinformatics. This is natural, as trees incorporate the two principles of adjacency (among sibling nodes) and embedding (from a parent node to its children). Abstract shapes, then, are also trees, but containing less detail. Any mapping from structures to shapes that is a tree homomorphism, i.e. preserves adjacency and embedding, can be used as shape abstraction. It is our decision which structural feature is to be retained and what is to be abstracted from.
In [11], five abstraction functions π ∈ {π_{1}, ... π_{5}} (that produce shapes of levels 1 through 5) were introduced. They all abstract from the length of helices and unpaired regions, but are more or less forgetful about the presence of structural features like bulges and internal loops.
These five abstractions are the ones also used within the present approach. Here we skip their formal definitions, and rather explain them by example. In doing so, we use a string representation of shapes akin to the dotbracket strings that commonly encode concrete structures. We use square brackets to denote helices (or helix parts) and underscores for unpaired regions. Their precise meaning, however, depends on the shape level used.
Note that the level5shape [] comprises all structures with level3shape [], [[]], [[[]]], and so on – this is because on level 5, helix interruptions are not accounted for at all, while on level 3, each interruption of a helix by a bulge or internal loop implies an extra helix part recorded by an extra pair of square brackets.
Shape levels 1, ..., 5 are designed to form a strict hierarchy: If π_{ i }(x) = π_{ i }(y), then also π_{ j }(x) = π_{ j }(y) for j > i. (In fact, we found an error in this respect in the original implementation of π_{2} and π_{4}, and corrected it in the course of this study. The hierarchy property is not strictly necessary for our filtering purpose here, but mathematically pleasing and useful in other applications of abstract shape analysis.)
Methods
Throughout this section, we use Rfam as "the" RNA family database. Note, however, that our filtering technique applies to any database that groups RNA sequences into structurally related families.
Shapebased indexing framework
 1.
For each family f ∈ Rfam, we compute a family shape spectrum fss(f).
 2.
{fss(f)  f ∈ Rfam} is converted into an index data structure I_{Rfam} such as a hash table or a suffix tree, with search access time independent of the size of I_{Rfam} and hence of Rfam size.
 3.
For a given query sequence x, we compute a query shape spectrum qss(x).
 4.
We access the index I_{Rfam} to determine the match set M(x) = {f  qss(x) ⋂ fss(f) ≠ ∅}.
 5.
If M(x) = = ∅, the query ends with a negative result, without access to Rfam. Otherwise, we execute cmsearch_{ f }(x) for each f ∈ M(x), which determines the outcome of the query.
Index construction (steps 1 and 2) must only be performed once for each release of the data base. The matching against the index (step 4) requires an exact match of a shape in qss(x) to a shape in some fss(f). This makes access to the index so fast that its execution time is negligible compared to a call to cmsearch. The shape spectra qss(x) and fss(f) can be computed in many different ways, and can be combined (almost) arbitrarily in indexing. We describe several variants and finally report which combination of ideas has worked best after extensive evaluations.
Alternative shapes of a sequence, as computed by RNAshapes, are ranked according to the free energy of their respective shreps. We denote by RNAshapes(k, π, x) the computation of the k topranked shapes of x under the shape abstraction function π, where π ∈ {π_{1}, ... π_{5}}. k = 0 means we compute all shapes for x. By π(s), we denote the shape of structure s. RNAfold_{C}(a, x) [13] denotes the minimum free energy structure of x under the constraint that the base pairs indicated in the annotation string a must be formed.
Shape index construction
1SS consshapeindex: fss(f) = {π(SS _cons)}
The simplest way to get a shape abstraction for each Rfam family is by translating the already given secondary structure consensus – that is the SS_cons row in the family alignment – with RNAshapes to one single shape. RNAshapes is not able to deal with pseudoknots, so these sparsely occurring structures must be resolved before by unpairing crossed over basepairs.
1consensusshapeindex: fss(f) = rankmin{⋂_{x∈f}RNAshapes(0, π, x)}
Evaluations revealed that the given SS_cons in Rfam often is not a really consensus structure in terms of a commonly shared shape. There are even a few families where not a single member folds in the shape π(SS_cons)! This can be caused by many reasons, e.g. inaccuracies in the thermodynamic model, too inhomogeneous families, or misbuilt families. Therefore, we construct consensusshapes directly from the family f. First we calculate all possible shapes for each family member. Then we scan for shapes common to all family members. If there are more than one common shapes, they are ranked by the sum of their individual ranks. The topranked common shape then represents the family in the index. This resembles the RNAcast approach to consensus structure prediction [14]. Using a trusted sequence alignment, this could also be done with RNAalifold [15] or RNAlishapes [16].
1hybridshapeindex
When a family is large, sequences are long, heterogeneous in structure, and shape abstraction level is low, the number of shapes to be computed to find a common shape may be impractical. In such a case, the 1consensusshapeindex construction resorts to 1SS_consshape indexing.
unionshapeindex: fss(f) = {π(RNAfold__{ C }(SS _cons, x))  x∈ f}
The first three approaches all use a single shape to represent a family. But often, a family is too diverged to be characterized well by a single shape. Instead, it can be described with one shape per sequence. We could simple use the shape of the MFE folding. However, to make use of the information in the family model, as captured by SS_cons, we use RNAfold for a constrained folding of each sequence, and compute the shapes from these folds. This implies that all shapes agree on the helices required by SS_cons, but may have additional helices in different places. We still expect fss(f) < f.
kbestshapeindex: fss(f) = ⋃_{x∈f}RNAshapes(k, π, x)
in the hash based shape index, where they later serve for the query lookups. The index is a hierarchically arranged, 3dimensional search structure – in descending search order of shapeabstractionlevel, shapestring and shrepenergy. The example also displays a very characteristic phenomenon: the weaker an abstraction is, the greater is the variance of shapes and the smaller are the familylists that are associated with these shapes. The last fact is not shown directly, but can be recognized by the increasing energyintervalsize.
Query shape spectrum construction
The five methods for index shape spectrum construction differ in the way they use family information, or effectively treat each family individually. For the query shape spectrum, we have only one sequence, and discuss only two variants.
1shapespectrum: qss(x) = RNAshapes(1, π, x)
This naive spectrum represents the query simply by its topranked shape, which by definition is the shape of its minimum free energy folding.
kshapespectrum: qss(x) = RNAshapes(k, π, x)
We tend not to trust solely in the minimum free energy folding of a sequence as its "true" structure. Accordingly, we should not use its shape alone. Choosing k > 1 shapes to represent the query results in a trade off between more shapes for a higher chance to find the right family, but potentially more fruitless calls to cmsearch.
Using multilevel abstraction
All the previous constructions of family and query shape spectra can be used in combination, under the restriction that both were made using the same shape abstraction function. But which abstraction level should be used? How does the level of abstraction influence indexbased search? On the lower abstraction levels (levels 1, 2), shapes are most specific. Relatively few families are associated with each shape. A shape match on this level often leads directly to the "right" family. However, if there is no match on the low abstraction level, the sequence may be a diverged family member, and a match on a higher abstraction level is still possible.
To take advantage of both, short runtime at low abstraction level and better chances to find diverged families with a strong abstraction, we construct fss(f) and qss(x) for level 1 through level 5, and the search iterates ascendingly through the five available abstraction levels. The hope is, that on average, a match can be found in one of the lower levels. Ascending through shape abstraction levels incurs practically no overhead – an unsuccessful CM search on (say) level 2 means that we need not research this family because of its hits to the index on a higher level.
Using folding energies
this normalized energy value. Several distinct peaks can be seen, showing that the shrep energy does not only depend on sequence length. Therefore, it provides an additional attribute of a family that can be used in filtration.
Omitting "difficult" families
Our evaluation shows that there are a few families which can hardly be found. Often (10 of 25 cases in Additional file 1), these are families with pseudoknots. Shape abstraction can, in principle, be extended to pseudoknots and implemented in pseudoknot folding programs such as pknotsRG [17, 18]. However, this has not been done yet. For the moment, it is not surprising that these families perform badly. In a large scale project, some more runtime can be saved by ignoring these families. Additional file 1 lists the most difficult families, sorted by their impact on the filter sensitivity.
Algorithm
The idea of shapebased indexing, as we have seen in the previous section, opens up a fourdimensional search space: We may vary the construction of family shape spectra, query shape spectra, shape abstraction level, and consider different tolerances in the use of energies. There are many tradeoffs. For example, when the family is represented by several shapes per family member, chances increase that describing the query with a single shape is sufficient.
We have explored numerous points in this methodical space, and for the routine application considered here – matching a large number of predicted ncRNAs as queries against Rfam – the choice of methods described in this section has worked best.
Method of choice
where E (q) denotes the shrep energy of shape q.
the corresponding level1shape index so the system goes to the next higher level. A lookup for the second and third level 2 query shape (blue coloured) results in the two candidate families RF00003 and RF00001. The exact but expensive cmsearch is now applied to the candidates but it turns out that they are not the right families. So the process continues with level 3. This time all three shapes match with the level 3 shape index and suggest six, two and six candidate families. Two of the six candidates from the upper shape have been checked already, so they can be removed from the candidate list to save runtime. Again, a cmsearch is performed for each of the remaining 13 candidates. After the candidate RF00432 (bold blue coloured) is identified as a true hit, the filter aborts the remaining operations and reports the identified hit to the user. The use of the shrep energies and the energyindices is not shown here, but one can imagine it as a second lookup placed between the first queryshape look up and the cmsearch in order to thin out the candidate list.
Results and Discussion
Leaving out the BLAST filtering, the existing search process for a query sequence uses all available covariance models from the Rfam database and compares them to the query via the cmsearch program. The Rfam is an increasing set of R covariance models, for Rfam 8.0 R is 574. So the runtime for one query would be O(R * n^{4}), in the worst case. Our approach reduces R to r_{ Sifter }, the number of models that have to be considered for a cmsearch comparison.
Construction of a testing set
Performance of a filtering system is governed by sensitivity and filtration ratio. To measure sensitivity, one needs a set of queries with known family membership. We build this set by dividing all Rfam full alignments into two disjoint sets. 60% sequences from each family are used for constructing the shape index. From the remaining 40% of the familysequences we choose at most four sequences for inclusion in the test set. This procedure yields 2030 test sequences in total, because some smaller families lack sufficient members. This set is named "family".
The second value for filtering performance is filtration ratio. In a whole genome screen, in most cases, a query will not result in a match to Rfam. The filtering suggests several possible families, which later have to be checked by cmsearch. Queries that result in a true match may lead to success after inspecting only a few suggested families, but queries with no family membership could only be rejected after checking each family that was recommended by the filter. This is why for evaluating efficiency, nonmatching samples in the testing set are important. They constitute the worst case in an application scenario.
The most obvious way of constructing nonmatching sequences is to generate them by random. Each nucleotide occurs with a relative frequency of 1/4. The length of the artificial sequences is uniformly distributed in the range from the shortest to the longest sequence in the Rfam database. 2000 random sequences build the "random_uniform" testing set.
In order to achieve a biological more realistic scenario, 11 proteincoding genes, considered unlikely to form any ncRNA, were selected from NCBI. Subsequently these genes were verified to truly not contain any ncRNA structures via the Rfam sequence search. These genes serve as source for further 2000 sequences, randomly cut out of the original genes. Again, lengths are uniformly distributed relative to Rfam. This set is called "genes_uniform". The chosen genes have the following accessionnumbers: gi110225369, gi85815826, gi109148525, gi45219732, gi79476965, gi20804396, gi40515, gi42568004, gi23297153, gi3005973 and gi46559395.
The energyindex depends on the folding energies and the folding energies are related to the sequence lengths. The previous sampling sets, namely "random_uniform" and "genes_uniform", are based on uniformly distributed sequence lengths in the Rfam database. But this is not true for the database. Rfams longest sequence has approximately 850 bases, but the overwhelming majority has around 50 to 150 nucleotides. The two further test sets "genes_nonuniform" and "random_nonuniform" reflect this imbalance, because their sequence lengths are distributed equally to Rfam. Each set is a composition of 2000 sequences. All test sets are provided on the RNAsifter web site, the whole evaluation was done with the union of all test sets (10030 sequences).
Testing
RNAsifter offers 11 parameters for choosing an acceptable tradeoff between sensitivity and filtration ratio:

Five parameters for the numbers of shapes for the query ${k}_{query}^{\pi}$. One for each shape abstraction function π_{1}, ⋯, π_{5}.

Five parameters for the numbers of shapes for shape index construction ${k}_{family}^{\pi}$. One for each shape abstraction function π_{1}, ⋯, π_{5}.

Percentage of tolerance between query and shape index energy ε.
We decided to freeze the number of shapes in each level to the same value and to sample with the following energy tolerances 1.0, 0.8, 0.6, 0.4, 0.3, 0.2, 0.1. So k_{ query }is the number of shapes for a query in all five shapelevels. k_{ family }is the same for shape index construction. This results in 5·5·7 = 175 different parameter settings. Each setting is used for an RNAsifter invocation with the app. 10.000 test sequences. The sensitivity (sen) is then calculated as the ratio between true positive RNAsifter outcomes and the overall number of positive test sequences, that is 2030: $sen=\frac{TP}{2030}$.
Filtration ratio (eff) is the ratio between all candidate families suggested by the RNAsifter and the number of all sequences multiplied with the number of families in Rfam: $eff=\frac{\#candidates}{\#sequences\times \#families}$. RNAalifold but with suboptimal structures of different shape), and the results from a complete run with the HMM filter (pink triangle in the lower right corner). 175 different parameter sets, see subsection Testing in the Evaluation, are used for the kbestshapeindex. The blue coloured data point (k_{ family }= 3, k_{ query }= 5, ε = 0.4) displays our recommended parameter settings as a practical tradeoff between filtration ratio and sensitivity.
k_{ family }for the kRNAlishapesshapeindex is handled as before, but the energyindex is switched off by a gigantic value for ε. The other five shapeindices use only one shape per family or per family member, so their k_{ family }is always set to 1. Their energyindex is switched off, too.
The results for all 175 parameter settings are depicted in Figure 1. The relatively wide parameter space allows to adjust RNAsifter for different applications. For a very accurate search the given parameters could be set to very high values, like our analysed maximum k_{ family }= 5, k_{ query }= 5, ε = 1.0 – this is the right uppermost data point in Figure 1. This unbalanced setup results in a good sensitivity of 98.67%, but it must be bought by a poor filtration ratio of only 27.58%. Vice versa, a fast runtime comes with a lower sensitivity, e.g. the data point k_{ family }= 1, k_{ query }= 5, ε = 0.1 indicates a good filtration ratio of 5.99% and a low sensitivity of 90.25%. Our recommendation (blue data point in Figure 1) for a good tradeoff between filtration ratio (15.03%) and sensitivity (97.44%) is the parameter setting k_{ family }= 3, k_{ query }= 5 and ε = 0.4.
Asymptotic runtimes. Asymptotic runtimes for the unfiltered cmsearch procedure and three filters, including our Sifter approach. R is the number of families in the current Rfam release. L is the target sequence length, N is the window size. f is the number of sequences in all families. r_{ p }is the remaining set of covariance models that have to be searched after the application of the filtering program p, r ≤ R.
Program  Asymptotic runtime 

CM  O(R·L·N^{3}) 
HMM  O(R·L·N^{2} + r_{ HMM }·L·N^{3}) 
BLAST  O(R·f·N + r_{ BLAST }·L·N^{3}) 
Sifter  O(N^{3} + r_{ Sifter }·L·N^{3}) 
"BLASTfilter": the Rfam Perl script "rfam_scan.pl" from the Rfam website is used. "RNAsifter" uses the kbestshapeindex with the suggested parameter set. The considered sequences for this comparison is the whole testset.
Conclusion
We have introduced the idea of shapeindex based filtering for faster search in structural RNA databases. The approach is based on the use of family shape spectra, query shape spectra, and shape abstraction levels, each of which can be computed in different ways. Central to all combinations of these constituents is that the shape index can be accessed with exact matching techniques, which requires negligible computing time. Different parameter settings have been explored systematically, with a filtration ratio ranging from 0.025 to 0.28, and sensitivity ranging from 0.81 to 0.99. We recommend a particular setting with filtration ratio 0.15 and sensitivity 0.974. Not all possibilities have been explored yet. For example, one could work with familyspecific parameters k_{ family }when constructing the index.
When (say) 15% of all family models must be searched, the practical speedup depends on which models these are. miRNAs, for example, can be searched with CM models much faster than (say) RNaseP. Therefore, we cannot simply derive a 40fold runtime speedup from a filtration ratio of 0.025. To provide a concrete example: The first 5003 ncRNA predictions from the RNAz screen [1] were matched against Rfam, and RNAsifter reduced runtime from 876.3 to 242.2 hours. Using the default parameters in this experiment, the filtration ratio is $\frac{1}{6}$, but runtime only decreases to $\frac{1}{4}$.
Two interesting observations can be drawn from the performance curve in Figure 1. Consider the relative position of the data points for (k_{ family }= 1, k_{ query }= 5, ε = 0.8) and (k_{ family }= 5, k_{ query }= 1, ε = 0.8), where the former is better both in terms of filtration ratio and of sensitivity. Hence, it is more important to consider multiple structures in the query than in the family. This has the plausible explanation that the several members in a family normally do not all fold into the same topranked shape, and this behaviour helps to find queries with similar behaviour. This means that structural variation within the family, to a certain extent, is positive information.
The other interesting observation suggested by the curve is that our method seems to hit the wall near 98% sensitivity. Assuming for a moment that shapeindex based filtering was a perfect method, a keen conclusion would be that 2% of Rfam sequences are misclassified. However, we know that our filtering cannot be perfect when structures are classified using pseudoknot features, which currently cannot be handled by shape abstraction. A closer look at the missed cases shows that these are mainly due to the members of the "difficult" families. They include the families RF00177, RF00373, RF00009 (with and without pseudoknots); a full list of 25 "difficult" families in the present Rfam release is given in Additional file 1.
In a preliminary investigation of human ncRNA predictions from the screen by Washietl et al. [1], applied to 35985 hypothetical ncRNAs of high RNAz score, our filtering technique discovered 4 new Rfam hits. On the other hand, it overlooked 43 hits found in the original study. This can be traced back to cases where the candidate ncRNA was poorly embedded in the RNAz window. Naturally, our structure based approach is more sensitive to this than a filter that looks for a short stretch of sequence similarity. Working with an adaptive window size is a current research problem in RNA gene prediction. RNAsifter will benefit from advances in this direction.
Comparing the different filtering techniques, the method of choice depends on the scenario. If (say) a complete genome is to be scanned, the HMMbased filtering [9, 10] provides a fast screening approach, where the faster HMMs must be run for each family, but only a few CM searches. Shapeindex based filtering does not provide a screening mode, and can only be applied when individual ncRNA candidates have been predicted by (computational or experimental) methods. The two filters can also be used in cooperation: The shape index can be used to further restrict the number of CM searches that have to be performed as the result of positive HMM filtering. Conversely, the HMM family model could be run prior to a CM search triggered by a shape index hit. Here, we have presented the shape indexing technique in its pure form. The tradeoffs achieved with filter combinations are a subject of future work.
second row family RF00017 is omitted, the third row omits families RF00017 and RF00230 and so on. Note that the testset shrinks, because the sequences of a skipped family are also removed from the set.
Declarations
Acknowledgements
SJ is grateful to PP Gardner and the Rfam group for indepth discussions of indexing strategies. SJ was partially funded as a research assistant by Bielefeld University.
Authors’ Affiliations
References
 Washietl S, Hofacker IL, Lukasser M, Hüttenhofer A, Stadler PF: Mapping of conserved RNA secondary structures predicts thousands of functional noncoding RNAs in the human genome. Nature Biotechnology 2005, 23(11):1383–90. 10.1038/nbt1144View ArticlePubMedGoogle Scholar
 Pedersen J, Bejerano G, Siepel A, Rosenbloom K, LindbladToh K, Lander E, Kent J, Miller W, Haussler D: Identification and Classification of Conserved RNA Secondary Structures in the Human Genome. PLoS Computational Biology 2006., 2(4):Google Scholar
 Torarinsson E, Sawera M, Havgaard JH, Fredholm M, Gorodkin J: Thousands of corresponding human and mouse genomic regions unalignable in primary sequence contain common RNA structure. Genome research 2006, 16(7):885–889. 10.1101/gr.5226606PubMed CentralView ArticlePubMedGoogle Scholar
 Missal K, Zhu X, Rose D, Deng W, Skogerbø G, Chen R, Stadler PF: Prediction of structured noncoding RNAs in the genomes of the nematodes Caenorhabditis elegans and Caenorhabditis briggsae. Journal of Experimental Zoology Part B: Molecular and Developmental Evolution 2006, 306B(4):1552–5007. 10.1002/jez.b.21086View ArticleGoogle Scholar
 GriffithsJones S, Moxon S, Marshall M, Khanna A, Eddy SR, Bateman A: Rfam: annotating noncoding RNAs in complete genomes. Nucleic Acids Res 2005, 33(Database issue):D121–4. 10.1093/nar/gki081PubMed CentralView ArticlePubMedGoogle Scholar
 Eddy SR, Durbin R: RNA sequence analysis using covariance models. Nucleic acids research 1994, 22(11):2079–2088. 10.1093/nar/22.11.2079PubMed CentralView ArticlePubMedGoogle Scholar
 Nawrocki EP, Eddy SR: QueryDependent Banding (QDB) for Faster RNA Similarity Searches. PLoS Computational Biology 2007., 3(3):Google Scholar
 GriffithsJones S, Bateman A, Marshall M, Khanna A, Eddy SR: Rfam: an RNA family database. Nucleic acids research 2003, 31: 439–441. 10.1093/nar/gkg006PubMed CentralView ArticlePubMedGoogle Scholar
 Weinberg Z, Ruzzo WL: Exploiting conserved structure for faster annotation of noncoding RNAs without loss of accuracy. Bioinformatics 2004, 20(suppl 1):334–341. 10.1093/bioinformatics/bth925View ArticleGoogle Scholar
 Weinberg Z, Ruzzo WL: Sequencebased heuristics for faster annotation of noncoding RNA families. Bioinformatics 2006, 22: 35–39. 10.1093/bioinformatics/bti743View ArticlePubMedGoogle Scholar
 Giegerich R, Voss B, Rehmsmeier M: Abstract Shapes of RNA. Nucleic acids research 2004, 32(16):4843–4851. 10.1093/nar/gkh779PubMed CentralView ArticlePubMedGoogle Scholar
 Voß B, Giegerich R, Rehmsmeier M: Complete probabilistic analysis of RNA shapes. BMC Biology 2006., 4(5):Google Scholar
 Hofacker IL, Fontana W, Stadler PF, Bonhoeffer S, Tacker M, Schuster P: Fast Folding and Comparison of RNA Secondary Structures. Monatshefte für Chemie 1994, 125: 167–188. 10.1007/BF00818163View ArticleGoogle Scholar
 Reeder J, Giegerich R: Consensus shapes: an alternative to the Sankoff algorithm for RNA consensus structure prediction. Bioinformatics 2005, 21(17):3516–3523. 10.1093/bioinformatics/bti577View ArticlePubMedGoogle Scholar
 Hofacker IL, Fekete M, Stadler PF: Secondary Structure Prediction for Aligned RNA Sequences. Journal of molecular biology 2002, 319(5):1059–1066. 10.1016/S00222836(02)00308XView ArticlePubMedGoogle Scholar
 Voß B: Structural analysis of aligned RNAs. Nucleic acids research 2006, 34(19):5471–5481. 10.1093/nar/gkl692PubMed CentralView ArticlePubMedGoogle Scholar
 Reeder J, Giegerich R: Design, implementation and evaluation of a practical pseudoknot folding algorithm based on thermodynamics. BMC Bioinformatics 2004., 5(104):Google Scholar
 Reeder J, Steffen P, Giegerich R: pknotsRG: RNA pseudoknot folding including nearoptimal structures and sliding windows. Nucleic acids research 2007, 35(suppl 2):W320–324. 10.1093/nar/gkm258PubMed CentralView ArticlePubMedGoogle Scholar
 RNAsifter[http://bibiserv.techfak.unibielefeld.de/rnasifter/]
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.