Non-coding RNA detection methods combined to improve usability, reproducibility and precision
© Raasch et al; licensee BioMed Central Ltd. 2010
Received: 10 May 2010
Accepted: 29 September 2010
Published: 29 September 2010
Non-coding RNAs gain more attention as their diverse roles in many cellular processes are discovered. At the same time, the need for efficient computational prediction of ncRNAs increases with the pace of sequencing technology. Existing tools are based on various approaches and techniques, but none of them provides a reliable ncRNA detector yet. Consequently, a natural approach is to combine existing tools. Due to a lack of standard input and output formats combination and comparison of existing tools is difficult. Also, for genomic scans they often need to be incorporated in detection workflows using custom scripts, which decreases transparency and reproducibility.
We developed a Java-based framework to integrate existing tools and methods for ncRNA detection. This framework enables users to construct transparent detection workflows and to combine and compare different methods efficiently. We demonstrate the effectiveness of combining detection methods in case studies with the small genomes of Escherichia coli, Listeria monocytogenes and Streptococcus pyogenes. With the combined method, we gained 10% to 20% precision for sensitivities from 30% to 80%. Further, we investigated Streptococcus pyogenes for novel ncRNAs. Using multiple methods--integrated by our framework--we determined four highly probable candidates. We verified all four candidates experimentally using RT-PCR.
We have created an extensible framework for practical, transparent and reproducible combination and comparison of ncRNA detection methods. We have proven the effectiveness of this approach in tests and by guiding experiments to find new ncRNAs. The software is freely available under the GNU General Public License (GPL), version 3 at http://www.sbi.uni-rostock.de/moses along with source code, screen shots, examples and tutorial material.
Non-coding RNAs have drawn much attention in the last couple of years, after being neglected for a long time . They are now known to play key roles in diverse cellular processes such as regulation of gene expression, splicing and directing chemical modifications [2, 3]. Functional categorization of RNAs is not yet complete as new functions are discovered continuously [4, 5].
Detection of non-coding RNA genes in genomic sequences is an urgent but unsolved problem in bioinformatics . The accelerated pace of sequencing technology further increases the need for reliable identification of ncRNAs . The main approaches to computational prediction of ncRNAs are compositional analysis, secondary structure prediction, structural or sequence-based homology and the use of promoters and terminator signals. Numerous tools following one of these approaches or combinations thereof exist [6, 8].
Compositional analysis can be a simple scan for local GC-content, an approach successful in AT-rich hyperthermophiles . Considering more compositional features in a machine learning approach has also shown success . Based on the fact that functional RNAs rely on a defined secondary structure, prediction of transcript minimum free energy is used as a means for detecting ncRNA genes . Freyhult examined different quantities that can be used for this approach . Sequence-based homology can be used for detection if reference genomes with appropriate evolutionary distances are available .
Successful tools such as QRNA  and RNAz  combine secondary structure prediction with a homology approach relying on multiple alignments. The most comprehensive RNA family database RFAM  uses covariance models combining structural and sequence conservation to establish RNA families. The covariance model can be used to find new members of existing families, however, at the expense of computational effort. Dynalign  uses an approximation of Sankoff's Algorithm for structural alignment of two RNAs.
Xiao et al. used promoter and terminator prediction in intergenic regions aided by conservation and secondary structure analysis to predict ncRNAs .
To achieve better accuracy, some tools limit the scope to specific ncRNA families such as tRNA, miRNA and snoRNA .
However, none of the available tools for general ncRNA detection has reached a level of reliability comparable to protein-gene detection software. In contrast to ncRNA genes, protein genes exhibit codon-bias, open reading frames and strong sequence conservation, simplifying their detection. Since the diverse methods for ncRNA detection are complementary, a practical approach is to combine the available methods, as suggested by recent reviews [6, 8, 19, 20]. Meyer et al. also remarked that many ncRNA detection methods rest on the assumption of a significant secondary structure, which may not always be necessary for a ncRNA to function . Consequently, even the more successful methods, which rely on this assumption, need to be complemented with others to achieve more comprehensive predictions.
The combination of methods allows for precise predictions by using candidates that are predicted by several methods, or finding more candidates by using predictions from all methods. If the combination is done under a well designed framework, reproducibility, transparency and comparison of predictions are improved as well.
Previous efforts for the integration of data and algorithms in genomic research exist: RNAStructure integrated secondary structure prediction and structure based homology analysis but is not easily extended and not readily useable for genomic scans . Tools such as sRNAfinder  combine several approaches to improve prediction results, but in a predefined way. The UCSC genome browser offers a huge amount of experimental data, pre-calculated predictions and analyses for a selected number of genomes . Basic functions for comparative genomics are available, extended by an interface to Galaxy. Galaxy is a project that also aims to overcome custom and redundant scripting for bioinformatics tasks in genomic research, but does not yet offer specialized tools for ncRNA prediction . TAVERNA is a powerful all-purpose framework, but its primary source of functionality "BioCatalogue" does not yet contain essential ncRNA related tools such as RNAz and Dynalign . LeARN is an extensible framework for annotating newly sequenced genomes, but it is more focused on processing trusted results from detection tools rather than improving predictions by the combination of analyses from different algorithms . Consequently, there is a need for a framework that is easy to use and specialized for non-coding RNA detection. The main goals of our project are:
Combination: Improving ncRNA detection by combining existing methods.
Comparison: Easy comparison of the prediction performance of different methods must be possible.
Reproducibility: application, combination and comparison of methods must be performed in a reproducible and transparent way.
Usability: User experience should be improved by a GUI and visualization of all workflow steps and their respective results. No programming should be required to construct workflows, and to combine and compare methods.
Our software is aimed at three user groups: First, for bioinformaticians, the use and the combination of integrated tools must be simple. Second, developers of new algorithms for ncRNA detection must be provided with a ready-to-use environment and test bed. This removes the need to re-program solutions for tasks such as parsing files or visualization. Third, biologists must be able to re-use tested methods easily.
The implementation presented here supports compositional analysis, sequence-based homology (BLAST ), sequence and structural homology (RNAz  and Dynalign ) and secondary structure prediction (using RNAfold ). Our tool can easily be extended through an open architecture.
We will show how moses was designed to fulfill the given goals in the next section. In case studies we then demonstrate the effectiveness of combining methods: Precision or sensitivity are increased alternatively. Furthermore, our framework has been successfully applied to guide experiments in Streptococcus pyogenes to find new ncRNAs.
An advantage of our modular approach is that it provides a good trade-off between flexibility and complexity: The user constructs workflows simply by chaining modules together and providing the parameters needed for its calculation. The output of every module can serve as the input for every other module. This allows for a free combination of modules while not requiring any programming skills.
The modules used to construct workflows can contain external tools, directly implemented analysis methods or helper functions. Each module represents one step in an analysis workflow. In the case of external tools, the module converts the input data, runs the tool, parses the output and converts it back into the moses format to ensure compatibility between all modules. Converting to one common data exchange format is more efficient than converting input and output between different tools, even though this is common practice in bioinformatics using custom scripts. The format we chose is a matrix of float values. Columns in the matrix correspond to nucleotide positions. Rows can hold different kinds of information, for instance ncRNA probability scores from several detection methods.
This basic format is very simple and yet can hold all types of information needed for the purpose of ncRNA prediction. The modules can be written using a data structure that is familiar to most programmers.
The parameters needed to run a module are saved in human readable format in the corresponding moses-file along with IDs identifying the source modules. This creates a structure of dependent calculations that form a detection workflow. Individual modules of this workflow can be exchanged to modify and re-use the workflow. For example, the modules holding the analysed genome can be exchanged to perform an identical scan on a different species.
Main detection methods
The key modules moses provides are BLAST , word frequency analysis (typically used for GC-content analysis), RNAfold , RNAz  (using ClustalW  for alignments), Dynalign  and calculation of DNA properties, such as base stacking energy or bendability. BLAST can be used to compare two genomes, scan a genome for occurrences of a query sequence or locate conserved regions of a genome by BLASTing against a local database created with the BLAST helper tool formatDB. The RNAfold module uses a sliding window approach. For each sliding window the minimum free energy structure is predicted, and the corresponding minimum free energy value is stored at the centre position of the respective window. This results in a numerical profile aligned with the genome's base pairs.
The RNAz module scans a genome for ncRNA, requiring the output of several BLAST modules. Again, a sliding window approach is used. For each window, the most similar regions in reference genomes, as detected by BLAST, are used to construct a multiple alignment using ClustalW. This alignment is then analysed for ncRNA by RNAz. Finally the RNAz-score (called "RNA class probability") is stored at the window's centre position.
Similar to the RNAz module, the Dynalign module uses a sliding window and relies on BLAST modules to find the most similar regions in reference genomes for each window. A structural alignment of the analysed window and the region with the best BLAST-score is calculated using Dynalign. The output for each window is the alignment score.
The DNA properties module is similar to the RNAfold module, it calculates a numerical profile corresponding to a certain physical property of a DNA subsequence. The work of Abeel et al. shows that profiles of thermodynamic DNA properties can be used to detect transcriptional signals. Those signals not pointing to known protein genes may be indications of ncRNA genes . DNA base stacking energy, bendability or protein induced deformability are examples for such properties. To calculate a numerical profile we use the procedure given by Abeel et al. : Each (overlapping) dinucleotide of a DNA sequence is converted into a number according to a conversion table. These values are then smoothed using a sliding window, for each window the average is calculated and stored at the centre position of the window. Parameters for the properties were taken from EP3, a promoter detection tool developed by Abeel et al. . The full list of available properties can be viewed on the tool's website (http://bioinformatics.psb.ugent.be/webtools/ep3/?conversion.)
Besides the key modules, predictions from RFAM can be incorporated by BLASTing against a RFAM dump. To include terminator predictions, output from TransTermHP  can be loaded as well.
All available methods can also be applied to sub-regions of a sequence. This is useful, for instance, to exclude protein gene regions from analysis or limit the calculations to a region of interest. Furthermore, moses provides a number of arithmetic and logical function to process the result of any method.
These functions are also applied for combining the detection methods to arrive at more precise or more sensitive predictions. Usually, this involves trading precision for sensitivity or vice versa. A more precise combined prediction is achieved by considering only predictions made by more than one individual method. A more sensitive combined prediction is achieved by collecting the predictions from all individual methods. A more sophisticated combination is to use weighted scores of individual methods: Reliable methods can be weighed higher then relatively unreliable ones to get a combined result that is more precise yet retains much of the sensitivity of the individual methods.
The quality of any single or combined detection method can be analysed and compared using the built-in statistical evaluation, if data of known ncRNAs is available.
Graphical User Interface and Visualization
To make our software accessible to a wide range of users and to enhance usability, we provide a graphical user interface. Included features of the interface are:
easy access to external tools as moses modules,
constructing workflows with visualization of the modules' dependencies,
multiple modes visualization of numerical profiles and for visual inspection, comparison and detection of correlations,
browsing of genome annotations or calculated prediction signals,
statistical assessment of each method, e.g., precision, sensitivity.
Integrated visualization of all intermediate results of a workflow helps finding mistakes, hypothesis generation and interpretation of results in the context of all available information.
Results & Discussion
Construction of the test regions for the case studies
Escherichia coli str. K-12 substr. MG1655
Listeria monocytogenes EGD-e
Streptococcus pyogenes MGAS5005
151 consisting of 11900 bp (5.3%)
101 consisting of 19440 bp (6.8%) 
73 consisting of 14625 bp (6.9%) 
32754 bp (14.7%)
37317 bp (13.1%)
37067 bp (17.5%)
180505 bp (81.0%)
231107 bp (81.1%)
163416 bp (77,0%)
Influence of the window size in Escherichia coli
Escherichia coli str. K-12 substr. MG1655
Listeria monocytogenes EGD-e
Streptococcus pyogenes MGAS5005
Enterobacter sp. 638
Klebsiella pneumoniae 342
Salmonella enterica subsp. enterica serovar Enteritidis str. P125109
Listeria innocua Clip11262
Listeria welshimeri serovar 6b str. SLCC5334
Listeria seeligeri serovar 1/2b str. SLCC3954
Streptococcus agalactiae 2603VR
Streptococcus equi subsp. zooepidemicus str. MGCS10565
Streptococcus pyogenes M1 GAS
Streptococcus pyogenes MGAS315
Genome for new predictions
Streptococcus pyogenes NZ131
Genomes for BLAST sequence conservation analysis
see moses website
The RNAz module scans a genome for ncRNA using a sliding window. For each window, the most similar regions in the reference genomes, as detected by the BLAST modules, are alignment together with the analysed window using ClustalW. The resulting multiple alignment is then analysed by RNAz to give a so called "RNA class probability".
In the Dynalign method, only the reference region with the highest BLAST score is used for structural alignment using Dynalign.
The RNAfold method consists of two steps: First the minimum free energy value of the energetically optimal fold is calculated for each window. Second, the distribution of minimum free energy values for sequences of the nucleotide composition and length given by the analysed window is sampled. To this end RNAfold calculates the minimum free energy value for shuffled versions of the original window. The shuffling method by Altschul et al.  is used to preserve not only the mono - but also the dinucleotide composition, because the secondary structure prediction is especially sensitive to the dinucleotide composition. For our tests we used 100 shuffled versions. Mean and standard deviation are obtained from the sampled distribution to estimate the significance of the actual minimum free energy value. The final output of the RNAfold method is the Z-score for each window. The Z-score is the difference of the value of the original window and the mean in standard deviations. The RNAfold module inverts the sign of the minimum free energy values for convenience. The RNAfold method and the RNAz method are closely related but RNAz does not use the Z-Score of the original sequence, it rather uses averages of the Z-Scores from all sequences in the alignment. Our results show that the pure Z-Score as used by Kavanaugh et al.  is useful for ncRNA prediction, however, the way RNAz approximates Z-Scores is orders of magnitudes faster and practically of the same accuracy.
For the integrated tools BLAST, ClustalW and RNAz default parameters are used. Graphical output of predicted structures is suppressed for RNAfold to save computation time. TranstermHP predictions for Listeria monocytogenes and Streptococcus pyogenes were download from the TranstermHP website (http://transterm.cbcb.umd.edu/), predictions for Escherichia coli were performed using the downloaded program using default parameters.
The methods were combined by applying threshold filters to results of the RNAz, RNAfold and Dynalign method. The thresholds were 0.995, 4.5 and 550 respectively. For TranstermHP the confidence score threshold was 70, the default value used for the pre-calculated predictions from the TranstermHP website. Based on the parameter scans that were performed we selected values that gave intermediate precision and sensitivity for the individual methods.
After the thresholds have been applied in the centres of each window a "0" is stored if the value was below or equal to the threshold, "1" if above. The values for those three methods were added, additionally a "1" was added for each base pair of a predicted terminator.
Sensitivity and precision were calculated using the usual definitions: Sensitivity is the ratio of true positive windows to all known ncRNA-containing windows. Precision is the ratio of true positive windows to the sum of true positive and false positive windows.
While window-based precision and sensitivity are good to compare different methods, they do not reflect the practical value of predictions that are to be used to guide experimental verification of candidates. In practice, several windows next to each other that are predicted to contain ncRNA will be seen as one predicted locus or signal (for our purposes we want to neglect gaps). Those signals will then be used to guide experiments instead of each individual window.
Therefore, we define signal precision as the ratio of signals that overlap known ncRNA to all signals as an analogue to precision, and we define signal sensitivity as the ratio of known ncRNA that overlap a signal to all known ncRNA as analogue to sensitivity. The signal-based figures can be misleading if used alone, as too long signals will yield high signal precision and signal sensitivity without being specific enough for experiments. In order to check the quality of the predictions, we also calculated the false positive rate defined as the ratio of false positive windows to all windows known to not contain ncRNA. Figure 3 shows the prediction quality of the four individual methods and the combined method for all species in terms of window-based and signal-based figures as well as the false positive rates.
The plots reveal that for a wide sensitivity range the combined method largely improves the "signal-precision" by 15% and the window-based precision by about 10% across the three tested species. The improvements are confirmed by the reduced false positive rate visible in Figure 3, subfigures 3a-c.
Our tests show that our software allows for a flexible and easy combination of ncRNA detection methods and that the combination improves detection results. Methods can easily be compared using the available statistics.
Prediction of novel ncRNAs
To show that moses can successfully guide experimental detection of ncRNA we present predictions for Streptococcus pyogenes NZ131, a human pathogen. We used four methods in a genomic scan to minimize false positives. As we have seen in the case studies, which were based on automated workflows, even the improved methods suffer from relatively high false positive rates. To arrive at a candidate list that had the most potential to be true ncRNA genes--in order to minimize unsuccessful experiments--we used manual inspection of multiple data sources instead.
All data calculated and the used parameters are available on the moses website.
The data sources were RNAfold secondary structure predictions, calculated DNA base stacking energy, BLAST-calculated conservation against related genomes and RNAz-predictions. The RNAfold module was used with window size 41, the DNA properties module with window size 81, step size 1 base pair in both cases. For RNAz the window size was 41 with step size 5 base pairs. The calculations were performed on the full genome sequence.
We examined the characteristic RNAfold and DNA base stacking energy profiles around known ncRNA genes to manually distinguish them from genomic background.
Also, isolated conserved spots were considered as clues for potential genes. Conservation was determining by BLASTing the NZ131 genome against all pyogenes serotypes in one module and against a selection of Streptococcus genomes in a second. Intergenic regions were examined for these four clues. Data used for the visual inspection is available on the moses website.
Criteria for visual inspection of intergenic regions in Streptococcus pyogenes NZ311
1) BLAST vs Streptococci
2) BLAST vs S. pyogenes
3) DNA base stacking energy
The combination of multiple methods, possible in moses, has yielded highly probable candidates. RNAz or BLAST alone, for instance, would have given us hundreds of candidate loci to examine (data on the moses website).
Our predictions demonstrate that the integrated approach possible with moses is able to guide experimental detection of new ncRNAs. The RT-PCR experiments are not sufficient to rule out that the observations are related to neighbouring transcripts rather than true ncRNA. Accordingly, further experiments are in progress to confirm and characterize the four new candidate ncRNAs and their function in Streptococcus pyogenes physiology and virulence.
The computationally more demanding algorithms RNAz-analysis and RNAfold secondary structure prediction with shuffled comparison sequences need approximately 120 hours for full prokaryotic genomes (assuming an average size of 4 MB) on standard workstation computers. Dynalign takes even longer because it performs full structural alignments.
Our tests were performed on a machine with Intel(R) Core(TM) 2 Duo CPU 2.66 GHz with 2 GB of RAM with the use of parallelization. The RNAfold method and then RNAz method both took three hours for 99.950 analysed windows (for an analysed genome, the number of base pairs minus the window size plus one equals the number of windows to analyse). The window size for both methods was 51. The windows step was one base pair to obtain maximum resolution. However, this is not to imply that we can predict the exact gene starts and ends. For the RNAfold method 100 shuffled windows were used and three reference genomes for the RNAz method. If the analysis is limited to the intergenic regions, the time is reduced depending on the percentage of coding regions of the genomes under examination (often, intergenic regions constitute 10% of prokaryotic genomes). Other ways to avoid too long calculations include choosing a larger step size and parallel calculations by dividing analysed genomes in smaller parts.
The methods used here are in principle not restricted to prokaryotes, but to the sheer size of the genome.
We developed a framework for reproducible, transparent and easy combination of existing ncRNA detection methods. Our contribution helps to satisfy the need for a combined approach as suggested by recent reviews [6, 8, 19, 20]. The main improvements our framework provides are:
Improved ncRNA detection by combining existing methods
Wrapping existing tools and methods in moses modules that convert input and output formats to a common data interchange format makes combination possible. We have demonstrated the effectiveness of combining methods in tests on Escherichia coli and Streptococcus pyogenes. Further we predicted novel ncRNAs in Streptococcus pyogenes using multiple methods to yield highly probable candidates, thereby reducing unsuccessful experiments. Final confirmation and subsequent characterisation of the candidates is in progress.
Facilitated comparison of methods
The used methods can readily be compared using the integrated accuracy report. Statistical figures such as signal based and by-base-pair precision, sensitivity are readily at hand. This allows effective evaluation of existing methods and the selection of appropriate methods, e.g., according to available reference genomes or given taxon.
Improved reproducibility, re-usability and transparency
Workflows are self-documented as all parameters and data dependencies are stored in the moses files. This means the workflows are transparent because no hidden conversions and no implied functions are performed, only the ones defined by the user. No custom scripts or custom in-house software is involved in studies carried out in moses. Furthermore, an existing workflow can be reused on different sequences, different data or altered parameters.
We created a GUI and visualization for all intermediate steps of a workflow. This enables to detect flaws in a workflow and helps to interpret the results. The integrated environment supports hypothesis generation and brings data and results in context with all available information.
The method of constructing workflows in moses is easy as it requires no programming and no scripting. This makes it an attractive tool for bioinformaticians. Extending the framework with new algorithms is made easy through an open architecture with a plug-in mechanism. Programming effort is thus minimized and developers of new algorithms are provided with a ready-to-use platform. Biologists can easily reuse existing workflows.
The next step in the development of our framework is the integration of further existing methods and algorithms. Combination of methods could be enhanced by including support for SVM training and classification. Possibly, in the course of adding more tools the scope could be expanded to not only find ncRNA genes but protein genes, promoters, terminators and transcription factor binding sites as well. The result would be a complete picture of a genome under one common framework.
A recent approach to detect regulatory regions is pattern recognition in profiles of physical properties of the DNA, see for instance . As our framework offers different sources for such profiles, not only based on physical properties, it is a natural extension of our work to apply pattern detection to the profiles calculated by moses.
The work of BK and NP was supported by a BMBF grant in the framework of the ERA-Net PathoGenoMics 2 program (FKZ 0315437B). The authors would like to thank Jana Normann for expert technical assistance and Alexander Raasch for helping with the Java implementation and preparation of figures. Ulf Liebal and Sarah Zaatreh helped proofreading the manuscript.
The work of OW, JV, US and PR is funded by the University of Rostock. OW and US were also supported by the Deutsche Forschungsgemeinschaft (DFG), project WO 991/4-1 and by the German Federal Ministry for Education and Research (BMBF) as part of the SysMoII program. JV is supported by the BMBF FORSYS program. We thank the anonymous reviewers for their constructive comments.
- Couzin J: Breakthrough of the year. Small RNAs make big splash. Science 2002, 298: 2296–2297. 10.1126/science.298.5602.2296View ArticlePubMedGoogle Scholar
- Storz G, Altuvia S, Wassarman KM: An abundance of RNA regulators. Annu Rev Biochem 2005, 74: 199–217. 10.1146/annurev.biochem.74.082803.133136View ArticlePubMedGoogle Scholar
- Eddy SR: Non-coding RNA genes and the modern RNA world. Nat Rev Genet 2001, 2: 919–929. 10.1038/35103511View ArticlePubMedGoogle Scholar
- Hannon GJ, Rivas FV, Murchison EP, Steitz JA: The expanding universe of noncoding RNAs. Cold Spring Harb Symp Quant Biol 2006, 71: 551–564. 10.1101/sqb.2006.71.064View ArticlePubMedGoogle Scholar
- Mercer TR, Dinger ME, Mattick JS: Long non-coding RNAs: insights into functions. Nat Rev Genet 2009, 10: 155–159. 10.1038/nrg2521View ArticlePubMedGoogle Scholar
- Machado-Lima A, Portillo HAD, Durham AM: Computational methods in noncoding RNA research. J Math Biol 2008, 56: 15–49. 10.1007/s00285-007-0122-6View ArticlePubMedGoogle Scholar
- Wold B, Myers RM: Sequence census methods for functional genomics. Nat Meth 2008, 5: 19–21. 10.1038/nmeth1157View ArticleGoogle Scholar
- Meyer IM: A practical guide to the art of RNA gene prediction. Brief Bioinform 2007, 8: 396–414. 10.1093/bib/bbm011View ArticlePubMedGoogle Scholar
- Klein RJ, Misulovin Z, Eddy SR: Noncoding RNA genes identified in AT-rich hyperthermophiles. Proc Natl Acad Sci USA 2002, 99: 7542–7547. 10.1073/pnas.112063799View ArticlePubMedPubMed CentralGoogle Scholar
- Carter RJ, Dubchak I, Holbrook SR: A computational approach to identify genes for functional RNAs in genomic sequences. Nucleic Acids Res 2001, 29: 3928–3938.PubMedPubMed CentralGoogle Scholar
- Kavanaugh LA, Dietrich FS: Non-coding RNA prediction and verification in Saccharomyces cerevisiae. PLoS Genet 2009, 5: e1000321. 10.1371/journal.pgen.1000321View ArticlePubMedPubMed CentralGoogle Scholar
- Freyhult E, Gardner PP, Moulton V: A comparison of RNA folding measures. BMC Bioinformatics 2005, 6: 241. 10.1186/1471-2105-6-241View ArticlePubMedPubMed CentralGoogle Scholar
- Wassarman KM, Repoila F, Rosenow C, Storz G, Gottesman S: Identification of novel small RNAs using comparative genomics and microarrays. Genes & Development 2001, 15: 1637–1651.View ArticleGoogle Scholar
- Rivas E, Eddy SR: Noncoding RNA gene detection using comparative sequence analysis. BMC Bioinformatics 2001, 2: 8. 10.1186/1471-2105-2-8View ArticlePubMedPubMed CentralGoogle Scholar
- Washietl S, Hofacker IL, Stadler PF: Fast and reliable prediction of noncoding RNAs. Proceedings of the National Academy of Sciences of the United States of America 2005, 102: 2454–2459. 10.1073/pnas.0409169102View ArticlePubMedPubMed CentralGoogle Scholar
- Gardner PP, Daub J, Tate JG, Nawrocki EP, Kolbe DL, Lindgreen S, Wilkinson AC, Finn RD, Griffiths-Jones S, Eddy SR, Bateman A: Rfam: updates to the RNA families database. Nucleic Acids Res 2009, 37: D136-D140. 10.1093/nar/gkn766View ArticlePubMedPubMed CentralGoogle Scholar
- Mathews DH: Predicting a set of minimal free energy RNA secondary structures common to two sequences. Bioinformatics 2005, 21: 2246–2253. 10.1093/bioinformatics/bti349View ArticlePubMedGoogle Scholar
- Xiao B, Li W, Guo G, Li B, Liu Z, Jia K, Guo Y, Mao X, Zou Q: Identification of small noncoding RNAs in Helicobacter pylori by a bioinformatics-based approach. Curr Microbiol 2009, 58: 258–263. 10.1007/s00284-008-9318-2View ArticlePubMedGoogle Scholar
- Pichon C, Felden B: Small RNA gene identification and mRNA target predictions in bacteria. Bioinformatics 2008, 24: 2807–2813. 10.1093/bioinformatics/btn560View ArticlePubMedGoogle Scholar
- Solda G, Makunin IV, Sezerman OU, Corradin A, Corti G, Guffanti A: An Ariadne's thread to the identification and annotation of noncoding RNAs in eukaryotes. Brief Bioinform 2009, 10: 475–489. 10.1093/bib/bbp022View ArticlePubMedGoogle Scholar
- Mathews DH, Disney MD, Childs JL, Schroeder SJ, Zuker M, Turner DH: Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proc Natl Acad Sci USA 2004, 101: 7287–7292. 10.1073/pnas.0401799101View ArticlePubMedPubMed CentralGoogle Scholar
- Tjaden B: Prediction of small, noncoding RNAs in bacteria using heterogeneous data. J Math Biol 2008, 56: 183–200. 10.1007/s00285-007-0079-5View ArticlePubMedGoogle Scholar
- Bina M: The genome browser at UCSC for locating genes, and much more! Mol Biotechnol 2008, 38: 269–275. 10.1007/s12033-007-9019-2View ArticlePubMedGoogle Scholar
- Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A: Galaxy: a platform for interactive large-scale genome analysis. Genome Res 2005, 15: 1451–1455. 10.1101/gr.4086505View ArticlePubMedPubMed CentralGoogle Scholar
- Hull D, Wolstencroft K, Stevens R, Goble C, Pocock MR, Li P, Oinn T: Taverna: a tool for building and running workflows of services. Nucleic Acids Res 2006, 34: W729-W732. 10.1093/nar/gkl320View ArticlePubMedPubMed CentralGoogle Scholar
- Noirot C, Gaspin C, Schiex T, Gouzy J: LeARN: a platform for detecting, clustering and annotating non-coding RNAs. BMC Bioinformatics 9: 21–21. 10.1186/1471-2105-9-21
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215: 403–410.View ArticlePubMedGoogle Scholar
- Hofacker I, Fontana W, Stadler P, Bonhoeffer S, Tacker M, Schuster P: Fast Folding and Comparison of RNA Secondary Structures. Monatsh Chem 1994, 125(188):167. 10.1007/BF00818163View ArticleGoogle Scholar
- Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL: GenBank. Nucleic Acids Res 2008, 36: D25-D30. 10.1093/nar/gkm929View ArticlePubMedPubMed CentralGoogle Scholar
- Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG: Clustal W and Clustal X version 2.0. Bioinformatics 2007, 23: 2947–2948. 10.1093/bioinformatics/btm404View ArticlePubMedGoogle Scholar
- Abeel T, Saeys Y, Bonnet E, Rouzé P, Van de Peer Y: Generic eukaryotic core promoter prediction using structural features of DNA. Genome Research 2008, 18: 310–323. 10.1101/gr.6991408View ArticlePubMedPubMed CentralGoogle Scholar
- Kingsford CL, Ayanbule K, Salzberg SL: Rapid, accurate, computational discovery of Rho-independent transcription terminators illuminates their relationship to DNA uptake. Genome Biol 2007, 8: R22. 10.1186/gb-2007-8-2-r22View ArticlePubMedPubMed CentralGoogle Scholar
- Altschul SF, Erickson BW: Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage. Mol Biol Evol 1985, 2: 526–538.PubMedGoogle Scholar
- Perez N, Treviño J, Liu Z, Ho SCM, Babitzke P, Sumby P: A Genome-Wide Analysis of Small Regulatory RNAs in the Human Pathogen Group A Streptococcus. PLoS ONE 2009, 4: e7668. 10.1371/journal.pone.0007668View ArticlePubMedPubMed CentralGoogle Scholar
- Wang C, Ding C, Meraz RF, Holbrook SR: PSoL: a positive sample only learning algorithm for finding non-coding RNA genes. Bioinformatics 2006, 22: 2590–2596. 10.1093/bioinformatics/btl441View ArticlePubMedGoogle Scholar
- Toledo-Arana A, Dussurget O, Nikitas G, Sesto N, Guet-Revillet H, Balestrino D, Loh E, Gripenland J, Tiensuu T, Vaitkevicius K, Barthelemy M, Vergassola M, Nahori M, Soubigou G, Regnault B, Coppee J, Lecuit M, Johansson J, Cossart P: The Listeria transcriptional landscape from saprophytism to virulence. Nature 2009, 459: 950–956. 10.1038/nature08080View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.