Non-coding RNA detection methods combined to improve usability, reproducibility and precision
© Raasch et al. 2010
Received: 10 May 2010
Accepted: 29 September 2010
Published: 29 September 2010
Skip to main content
© Raasch et al. 2010
Received: 10 May 2010
Accepted: 29 September 2010
Published: 29 September 2010
Non-coding RNAs gain more attention as their diverse roles in many cellular processes are discovered. At the same time, the need for efficient computational prediction of ncRNAs increases with the pace of sequencing technology. Existing tools are based on various approaches and techniques, but none of them provides a reliable ncRNA detector yet. Consequently, a natural approach is to combine existing tools. Due to a lack of standard input and output formats combination and comparison of existing tools is difficult. Also, for genomic scans they often need to be incorporated in detection workflows using custom scripts, which decreases transparency and reproducibility.
We developed a Java-based framework to integrate existing tools and methods for ncRNA detection. This framework enables users to construct transparent detection workflows and to combine and compare different methods efficiently. We demonstrate the effectiveness of combining detection methods in case studies with the small genomes of Escherichia coli, Listeria monocytogenes and Streptococcus pyogenes. With the combined method, we gained 10% to 20% precision for sensitivities from 30% to 80%. Further, we investigated Streptococcus pyogenes for novel ncRNAs. Using multiple methods--integrated by our framework--we determined four highly probable candidates. We verified all four candidates experimentally using RT-PCR.
We have created an extensible framework for practical, transparent and reproducible combination and comparison of ncRNA detection methods. We have proven the effectiveness of this approach in tests and by guiding experiments to find new ncRNAs. The software is freely available under the GNU General Public License (GPL), version 3 at http://www.sbi.uni-rostock.de/moses along with source code, screen shots, examples and tutorial material.
Non-coding RNAs have drawn much attention in the last couple of years, after being neglected for a long time . They are now known to play key roles in diverse cellular processes such as regulation of gene expression, splicing and directing chemical modifications [2, 3]. Functional categorization of RNAs is not yet complete as new functions are discovered continuously [4, 5].
Detection of non-coding RNA genes in genomic sequences is an urgent but unsolved problem in bioinformatics . The accelerated pace of sequencing technology further increases the need for reliable identification of ncRNAs . The main approaches to computational prediction of ncRNAs are compositional analysis, secondary structure prediction, structural or sequence-based homology and the use of promoters and terminator signals. Numerous tools following one of these approaches or combinations thereof exist [6, 8].
Compositional analysis can be a simple scan for local GC-content, an approach successful in AT-rich hyperthermophiles . Considering more compositional features in a machine learning approach has also shown success . Based on the fact that functional RNAs rely on a defined secondary structure, prediction of transcript minimum free energy is used as a means for detecting ncRNA genes . Freyhult examined different quantities that can be used for this approach . Sequence-based homology can be used for detection if reference genomes with appropriate evolutionary distances are available .
Successful tools such as QRNA  and RNAz  combine secondary structure prediction with a homology approach relying on multiple alignments. The most comprehensive RNA family database RFAM  uses covariance models combining structural and sequence conservation to establish RNA families. The covariance model can be used to find new members of existing families, however, at the expense of computational effort. Dynalign  uses an approximation of Sankoff's Algorithm for structural alignment of two RNAs.
Xiao et al. used promoter and terminator prediction in intergenic regions aided by conservation and secondary structure analysis to predict ncRNAs .
To achieve better accuracy, some tools limit the scope to specific ncRNA families such as tRNA, miRNA and snoRNA .
However, none of the available tools for general ncRNA detection has reached a level of reliability comparable to protein-gene detection software. In contrast to ncRNA genes, protein genes exhibit codon-bias, open reading frames and strong sequence conservation, simplifying their detection. Since the diverse methods for ncRNA detection are complementary, a practical approach is to combine the available methods, as suggested by recent reviews [6, 8, 19, 20]. Meyer et al. also remarked that many ncRNA detection methods rest on the assumption of a significant secondary structure, which may not always be necessary for a ncRNA to function . Consequently, even the more successful methods, which rely on this assumption, need to be complemented with others to achieve more comprehensive predictions.
The combination of methods allows for precise predictions by using candidates that are predicted by several methods, or finding more candidates by using predictions from all methods. If the combination is done under a well designed framework, reproducibility, transparency and comparison of predictions are improved as well.
Previous efforts for the integration of data and algorithms in genomic research exist: RNAStructure integrated secondary structure prediction and structure based homology analysis but is not easily extended and not readily useable for genomic scans . Tools such as sRNAfinder  combine several approaches to improve prediction results, but in a predefined way. The UCSC genome browser offers a huge amount of experimental data, pre-calculated predictions and analyses for a selected number of genomes . Basic functions for comparative genomics are available, extended by an interface to Galaxy. Galaxy is a project that also aims to overcome custom and redundant scripting for bioinformatics tasks in genomic research, but does not yet offer specialized tools for ncRNA prediction . TAVERNA is a powerful all-purpose framework, but its primary source of functionality "BioCatalogue" does not yet contain essential ncRNA related tools such as RNAz and Dynalign . LeARN is an extensible framework for annotating newly sequenced genomes, but it is more focused on processing trusted results from detection tools rather than improving predictions by the combination of analyses from different algorithms . Consequently, there is a need for a framework that is easy to use and specialized for non-coding RNA detection. The main goals of our project are:
Combination: Improving ncRNA detection by combining existing methods.
Comparison: Easy comparison of the prediction performance of different methods must be possible.
Reproducibility: application, combination and comparison of methods must be performed in a reproducible and transparent way.
Usability: User experience should be improved by a GUI and visualization of all workflow steps and their respective results. No programming should be required to construct workflows, and to combine and compare methods.
Our software is aimed at three user groups: First, for bioinformaticians, the use and the combination of integrated tools must be simple. Second, developers of new algorithms for ncRNA detection must be provided with a ready-to-use environment and test bed. This removes the need to re-program solutions for tasks such as parsing files or visualization. Third, biologists must be able to re-use tested methods easily.
The implementation presented here supports compositional analysis, sequence-based homology (BLAST ), sequence and structural homology (RNAz  and Dynalign ) and secondary structure prediction (using RNAfold ). Our tool can easily be extended through an open architecture.
We will show how moses was designed to fulfill the given goals in the next section. In case studies we then demonstrate the effectiveness of combining methods: Precision or sensitivity are increased alternatively. Furthermore, our framework has been successfully applied to guide experiments in Streptococcus pyogenes to find new ncRNAs.
An advantage of our modular approach is that it provides a good trade-off between flexibility and complexity: The user constructs workflows simply by chaining modules together and providing the parameters needed for its calculation. The output of every module can serve as the input for every other module. This allows for a free combination of modules while not requiring any programming skills.
The modules used to construct workflows can contain external tools, directly implemented analysis methods or helper functions. Each module represents one step in an analysis workflow. In the case of external tools, the module converts the input data, runs the tool, parses the output and converts it back into the moses format to ensure compatibility between all modules. Converting to one common data exchange format is more efficient than converting input and output between different tools, even though this is common practice in bioinformatics using custom scripts. The format we chose is a matrix of float values. Columns in the matrix correspond to nucleotide positions. Rows can hold different kinds of information, for instance ncRNA probability scores from several detection methods.
This basic format is very simple and yet can hold all types of information needed for the purpose of ncRNA prediction. The modules can be written using a data structure that is familiar to most programmers.
The parameters needed to run a module are saved in human readable format in the corresponding moses-file along with IDs identifying the source modules. This creates a structure of dependent calculations that form a detection workflow. Individual modules of this workflow can be exchanged to modify and re-use the workflow. For example, the modules holding the analysed genome can be exchanged to perform an identical scan on a different species.
The key modules moses provides are BLAST , word frequency analysis (typically used for GC-content analysis), RNAfold , RNAz  (using ClustalW  for alignments), Dynalign  and calculation of DNA properties, such as base stacking energy or bendability. BLAST can be used to compare two genomes, scan a genome for occurrences of a query sequence or locate conserved regions of a genome by BLASTing against a local database created with the BLAST helper tool formatDB. The RNAfold module uses a sliding window approach. For each sliding window the minimum free energy structure is predicted, and the corresponding minimum free energy value is stored at the centre position of the respective window. This results in a numerical profile aligned with the genome's base pairs.
The RNAz module scans a genome for ncRNA, requiring the output of several BLAST modules. Again, a sliding window approach is used. For each window, the most similar regions in reference genomes, as detected by BLAST, are used to construct a multiple alignment using ClustalW. This alignment is then analysed for ncRNA by RNAz. Finally the RNAz-score (called "RNA class probability") is stored at the window's centre position.
Similar to the RNAz module, the Dynalign module uses a sliding window and relies on BLAST modules to find the most similar regions in reference genomes for each window. A structural alignment of the analysed window and the region with the best BLAST-score is calculated using Dynalign. The output for each window is the alignment score.
The DNA properties module is similar to the RNAfold module, it calculates a numerical profile corresponding to a certain physical property of a DNA subsequence. The work of Abeel et al. shows that profiles of thermodynamic DNA properties can be used to detect transcriptional signals. Those signals not pointing to known protein genes may be indications of ncRNA genes . DNA base stacking energy, bendability or protein induced deformability are examples for such properties. To calculate a numerical profile we use the procedure given by Abeel et al. : Each (overlapping) dinucleotide of a DNA sequence is converted into a number according to a conversion table. These values are then smoothed using a sliding window, for each window the average is calculated and stored at the centre position of the window. Parameters for the properties were taken from EP3, a promoter detection tool developed by Abeel et al. . The full list of available properties can be viewed on the tool's website (http://bioinformatics.psb.ugent.be/webtools/ep3/?conversion.)
Besides the key modules, predictions from RFAM can be incorporated by BLASTing against a RFAM dump. To include terminator predictions, output from TransTermHP  can be loaded as well.
All available methods can also be applied to sub-regions of a sequence. This is useful, for instance, to exclude protein gene regions from analysis or limit the calculations to a region of interest. Furthermore, moses provides a number of arithmetic and logical function to process the result of any method.
These functions are also applied for combining the detection methods to arrive at more precise or more sensitive predictions. Usually, this involves trading precision for sensitivity or vice versa. A more precise combined prediction is achieved by considering only predictions made by more than one individual method. A more sensitive combined prediction is achieved by collecting the predictions from all individual methods. A more sophisticated combination is to use weighted scores of individual methods: Reliable methods can be weighed higher then relatively unreliable ones to get a combined result that is more precise yet retains much of the sensitivity of the individual methods.
The quality of any single or combined detection method can be analysed and compared using the built-in statistical evaluation, if data of known ncRNAs is available.
To make our software accessible to a wide range of users and to enhance usability, we provide a graphical user interface. Included features of the interface are:
easy access to external tools as moses modules,
constructing workflows with visualization of the modules' dependencies,
multiple modes visualization of numerical profiles and for visual inspection, comparison and detection of correlations,
browsing of genome annotations or calculated prediction signals,
statistical assessment of each method, e.g., precision, sensitivity.
Integrated visualization of all intermediate results of a workflow helps finding mistakes, hypothesis generation and interpretation of results in the context of all available information.
Construction of the test regions for the case studies
Escherichia coli str. K-12 substr. MG1655
Listeria monocytogenes EGD-e
Streptococcus pyogenes MGAS5005
151 consisting of 11900 bp (5.3%)
101 consisting of 19440 bp (6.8%) 
73 consisting of 14625 bp (6.9%) 
32754 bp (14.7%)
37317 bp (13.1%)
37067 bp (17.5%)
180505 bp (81.0%)
231107 bp (81.1%)
163416 bp (77,0%)
Influence of the window size in Escherichia coli
Escherichia coli str. K-12 substr. MG1655
Listeria monocytogenes EGD-e
Streptococcus pyogenes MGAS5005
Enterobacter sp. 638
Klebsiella pneumoniae 342
Salmonella enterica subsp. enterica serovar Enteritidis str. P125109
Listeria innocua Clip11262
Listeria welshimeri serovar 6b str. SLCC5334
Listeria seeligeri serovar 1/2b str. SLCC3954
Streptococcus agalactiae 2603VR
Streptococcus equi subsp. zooepidemicus str. MGCS10565
Streptococcus pyogenes M1 GAS
Streptococcus pyogenes MGAS315
Genome for new predictions
Streptococcus pyogenes NZ131
Genomes for BLAST sequence conservation analysis
see moses website
The RNAz module scans a genome for ncRNA using a sliding window. For each window, the most similar regions in the reference genomes, as detected by the BLAST modules, are alignment together with the analysed window using ClustalW. The resulting multiple alignment is then analysed by RNAz to give a so called "RNA class probability".
In the Dynalign method, only the reference region with the highest BLAST score is used for structural alignment using Dynalign.
The RNAfold method consists of two steps: First the minimum free energy value of the energetically optimal fold is calculated for each window. Second, the distribution of minimum free energy values for sequences of the nucleotide composition and length given by the analysed window is sampled. To this end RNAfold calculates the minimum free energy value for shuffled versions of the original window. The shuffling method by Altschul et al.  is used to preserve not only the mono - but also the dinucleotide composition, because the secondary structure prediction is especially sensitive to the dinucleotide composition. For our tests we used 100 shuffled versions. Mean and standard deviation are obtained from the sampled distribution to estimate the significance of the actual minimum free energy value. The final output of the RNAfold method is the Z-score for each window. The Z-score is the difference of the value of the original window and the mean in standard deviations. The RNAfold module inverts the sign of the minimum free energy values for convenience. The RNAfold method and the RNAz method are closely related but RNAz does not use the Z-Score of the original sequence, it rather uses averages of the Z-Scores from all sequences in the alignment. Our results show that the pure Z-Score as used by Kavanaugh et al.  is useful for ncRNA prediction, however, the way RNAz approximates Z-Scores is orders of magnitudes faster and practically of the same accuracy.
For the integrated tools BLAST, ClustalW and RNAz default parameters are used. Graphical output of predicted structures is suppressed for RNAfold to save computation time. TranstermHP predictions for Listeria monocytogenes and Streptococcus pyogenes were download from the TranstermHP website (http://transterm.cbcb.umd.edu/), predictions for Escherichia coli were performed using the downloaded program using default parameters.
The methods were combined by applying threshold filters to results of the RNAz, RNAfold and Dynalign method. The thresholds were 0.995, 4.5 and 550 respectively. For TranstermHP the confidence score threshold was 70, the default value used for the pre-calculated predictions from the TranstermHP website. Based on the parameter scans that were performed we selected values that gave intermediate precision and sensitivity for the individual methods.
After the thresholds have been applied in the centres of each window a "0" is stored if the value was below or equal to the threshold, "1" if above. The values for those three methods were added, additionally a "1" was added for each base pair of a predicted terminator.
Sensitivity and precision were calculated using the usual definitions: Sensitivity is the ratio of true positive windows to all known ncRNA-containing windows. Precision is the ratio of true positive windows to the sum of true positive and false positive windows.
While window-based precision and sensitivity are good to compare different methods, they do not reflect the practical value of predictions that are to be used to guide experimental verification of candidates. In practice, several windows next to each other that are predicted to contain ncRNA will be seen as one predicted locus or signal (for our purposes we want to neglect gaps). Those signals will then be used to guide experiments instead of each individual window.
Therefore, we define signal precision as the ratio of signals that overlap known ncRNA to all signals as an analogue to precision, and we define signal sensitivity as the ratio of known ncRNA that overlap a signal to all known ncRNA as analogue to sensitivity. The signal-based figures can be misleading if used alone, as too long signals will yield high signal precision and signal sensitivity without being specific enough for experiments. In order to check the quality of the predictions, we also calculated the false positive rate defined as the ratio of false positive windows to all windows known to not contain ncRNA. Figure 3 shows the prediction quality of the four individual methods and the combined method for all species in terms of window-based and signal-based figures as well as the false positive rates.
The plots reveal that for a wide sensitivity range the combined method largely improves the "signal-precision" by 15% and the window-based precision by about 10% across the three tested species. The improvements are confirmed by the reduced false positive rate visible in Figure 3, subfigures 3a-c.
Our tests show that our software allows for a flexible and easy combination of ncRNA detection methods and that the combination improves detection results. Methods can easily be compared using the available statistics.
To show that moses can successfully guide experimental detection of ncRNA we present predictions for Streptococcus pyogenes NZ131, a human pathogen. We used four methods in a genomic scan to minimize false positives. As we have seen in the case studies, which were based on automated workflows, even the improved methods suffer from relatively high false positive rates. To arrive at a candidate list that had the most potential to be true ncRNA genes--in order to minimize unsuccessful experiments--we used manual inspection of multiple data sources instead.
All data calculated and the used parameters are available on the moses website.
The data sources were RNAfold secondary structure predictions, calculated DNA base stacking energy, BLAST-calculated conservation against related genomes and RNAz-predictions. The RNAfold module was used with window size 41, the DNA properties module with window size 81, step size 1 base pair in both cases. For RNAz the window size was 41 with step size 5 base pairs. The calculations were performed on the full genome sequence.
We examined the characteristic RNAfold and DNA base stacking energy profiles around known ncRNA genes to manually distinguish them from genomic background.
Also, isolated conserved spots were considered as clues for potential genes. Conservation was determining by BLASTing the NZ131 genome against all pyogenes serotypes in one module and against a selection of Streptococcus genomes in a second. Intergenic regions were examined for these four clues. Data used for the visual inspection is available on the moses website.
Criteria for visual inspection of intergenic regions in Streptococcus pyogenes NZ311
1) BLAST vs Streptococci
2) BLAST vs S. pyogenes
3) DNA base stacking energy
The combination of multiple methods, possible in moses, has yielded highly probable candidates. RNAz or BLAST alone, for instance, would have given us hundreds of candidate loci to examine (data on the moses website).
Our predictions demonstrate that the integrated approach possible with moses is able to guide experimental detection of new ncRNAs. The RT-PCR experiments are not sufficient to rule out that the observations are related to neighbouring transcripts rather than true ncRNA. Accordingly, further experiments are in progress to confirm and characterize the four new candidate ncRNAs and their function in Streptococcus pyogenes physiology and virulence.
The computationally more demanding algorithms RNAz-analysis and RNAfold secondary structure prediction with shuffled comparison sequences need approximately 120 hours for full prokaryotic genomes (assuming an average size of 4 MB) on standard workstation computers. Dynalign takes even longer because it performs full structural alignments.
Our tests were performed on a machine with Intel(R) Core(TM) 2 Duo CPU 2.66 GHz with 2 GB of RAM with the use of parallelization. The RNAfold method and then RNAz method both took three hours for 99.950 analysed windows (for an analysed genome, the number of base pairs minus the window size plus one equals the number of windows to analyse). The window size for both methods was 51. The windows step was one base pair to obtain maximum resolution. However, this is not to imply that we can predict the exact gene starts and ends. For the RNAfold method 100 shuffled windows were used and three reference genomes for the RNAz method. If the analysis is limited to the intergenic regions, the time is reduced depending on the percentage of coding regions of the genomes under examination (often, intergenic regions constitute 10% of prokaryotic genomes). Other ways to avoid too long calculations include choosing a larger step size and parallel calculations by dividing analysed genomes in smaller parts.
The methods used here are in principle not restricted to prokaryotes, but to the sheer size of the genome.
We developed a framework for reproducible, transparent and easy combination of existing ncRNA detection methods. Our contribution helps to satisfy the need for a combined approach as suggested by recent reviews [6, 8, 19, 20]. The main improvements our framework provides are:
Wrapping existing tools and methods in moses modules that convert input and output formats to a common data interchange format makes combination possible. We have demonstrated the effectiveness of combining methods in tests on Escherichia coli and Streptococcus pyogenes. Further we predicted novel ncRNAs in Streptococcus pyogenes using multiple methods to yield highly probable candidates, thereby reducing unsuccessful experiments. Final confirmation and subsequent characterisation of the candidates is in progress.
The used methods can readily be compared using the integrated accuracy report. Statistical figures such as signal based and by-base-pair precision, sensitivity are readily at hand. This allows effective evaluation of existing methods and the selection of appropriate methods, e.g., according to available reference genomes or given taxon.
Workflows are self-documented as all parameters and data dependencies are stored in the moses files. This means the workflows are transparent because no hidden conversions and no implied functions are performed, only the ones defined by the user. No custom scripts or custom in-house software is involved in studies carried out in moses. Furthermore, an existing workflow can be reused on different sequences, different data or altered parameters.
We created a GUI and visualization for all intermediate steps of a workflow. This enables to detect flaws in a workflow and helps to interpret the results. The integrated environment supports hypothesis generation and brings data and results in context with all available information.
The method of constructing workflows in moses is easy as it requires no programming and no scripting. This makes it an attractive tool for bioinformaticians. Extending the framework with new algorithms is made easy through an open architecture with a plug-in mechanism. Programming effort is thus minimized and developers of new algorithms are provided with a ready-to-use platform. Biologists can easily reuse existing workflows.
The next step in the development of our framework is the integration of further existing methods and algorithms. Combination of methods could be enhanced by including support for SVM training and classification. Possibly, in the course of adding more tools the scope could be expanded to not only find ncRNA genes but protein genes, promoters, terminators and transcription factor binding sites as well. The result would be a complete picture of a genome under one common framework.
A recent approach to detect regulatory regions is pattern recognition in profiles of physical properties of the DNA, see for instance . As our framework offers different sources for such profiles, not only based on physical properties, it is a natural extension of our work to apply pattern detection to the profiles calculated by moses.
The work of BK and NP was supported by a BMBF grant in the framework of the ERA-Net PathoGenoMics 2 program (FKZ 0315437B). The authors would like to thank Jana Normann for expert technical assistance and Alexander Raasch for helping with the Java implementation and preparation of figures. Ulf Liebal and Sarah Zaatreh helped proofreading the manuscript.
The work of OW, JV, US and PR is funded by the University of Rostock. OW and US were also supported by the Deutsche Forschungsgemeinschaft (DFG), project WO 991/4-1 and by the German Federal Ministry for Education and Research (BMBF) as part of the SysMoII program. JV is supported by the BMBF FORSYS program. We thank the anonymous reviewers for their constructive comments.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.