An automated pipeline named StructRNAfinder
To identify potential noncoding RNAs in genome or transcriptome sequences, research groups have to manually run several programs, which generate different formats of intermediate files. StructRNAfinder automates this laborious workflow, processing the data obtained by each employed tool, thus allowing non-bioinformaticians to identify and compare ncRNAs through the primary sequence and secondary structure inferences. All the files generated along the workflow are displayed in user-friendly reports and subsequently made available for downstream analyses. StructRNAfinder utilizes Infernal [14] to annotate genome/transcriptome-derived sequences to the corresponding RNA families. For data derived from next generation sequencing (NGS) studies, it is necessary to have the final sets of assembled sequences. Thus, all sequences are compared against covariance models, which represent the sequence/structural consensus alignments for each RNA family, reported to date in Rfam database [15].
One issue that arises when comparing sequences and covariance models is that current tools only provide results in plain text outputs, which contains the sequence-structure alignments, positional coordinates and its statistics. No information is provided related to a potential annotation of the predicted RNA families, neither images with the potential RNA secondary structure, its description in the standard dot and bracket format, nor its functional description. StructRNAfinder automatically explores and parses Infernal alignments output, by filtering and extracting significant hits for each sequence/covariance model comparison. Based on its mapping coordinates, the tool calculates the length of alignment, the size of the input sequence and the size of the target RNA family. If necessary, the hit length from input sequence is expanded, in order to obtain a mature sequence with a similar size to that of the original Rfam secondary structure, which is used as input to RNAfold for secondary structure predictions. This tool is available in Vienna package [3], which is a widely-used suite of tools to analyse RNA structures. In the final structure, the region assigned to the alignment is highlighted in green. This procedure assures the length needed to estimate the optimal minimum energy, which is sequence- and length-dependent [18]. This secondary structure is a visual representation of the predicted structure, to be compared with those originally generated by Rfam, which is also available on the final report. The text representation of the structure generated by RNAfold and from the CM alignment are also reported. Once an RNA family is assigned, StructRNAfinder automatically retrieves all annotation information available in Rfam database for that particular RNA (i.e. family description, gene ontology, taxonomy, family secondary structure image). The general procedure performed by StructRNAfinder is explained in Fig. 1a.
Comprehensive reports
The reports generated by StructRNAfinder contain the annotation and statistics for all RNA families, secondary structures, alignments, functions and taxonomic assignations identified in the input sequence(s). These reports are provided in HTML format and contain tables and figures that can be used for further data exploration. For instance, the index.html file (available in the main folder of the stand-alone version) displays a table containing all significant hits obtained from the alignment between covariance models and input sequences (Fig. 1b). The menu on the left of the table is generated dynamically according to results and allows quick navigation through the different RNA families identified in the input sequences. If users click on the hyperlink associated with each identifier hit name, a new page is opened containing the complete information of the predicted RNA (Fig. 1c), such as the full description of RNA function (if available), associated gene ontology, covariance model alignment, secondary structure predicted by RNAfold [3] and reported secondary structure from the reference RNA available in Rfam. Briefly, this page contains information extracted from Infernal, RNAfold, Rfam database and generated by our in-house Perl scripts. A general overview of statistics and a graphic representation of predicted RNA families (Fig. 2a, b) are accessible in the Summary section.
Visual distribution of predicted RNA families
Users can visualize the localization of each predicted RNA along the nucleotide query sequence. This can be useful to identify potential RNA clusters generated from a unique precursor RNA or to obtain a genome-wide visualization of predicted RNAs in a whole or partial genome sequence. The Loci distribution section provides a visual representation of all RNA families identified along the analysed nucleotide sequence. If more than one sequence is used as input, this page will provide one image for each analysed sequence with the general distribution of RNA families.
Taxonomic distribution and visualization
StructRNAfinder recovers the taxonomy of each predicted RNA family, according to Rfam annotation for each RNA family. We used Krona [17] to generate interactive graphics that show the abundance of all RNAs belonging to different taxonomic groups based on Rfam species annotations. In Rfam database, each RNA family is the result of multiple sequences alignments from different species. StructRNAfinder summarizes and plots the presence of all predicted RNAs according to three domains of life, plus viruses (Fig. 2c). For instance, the graphic on Fig. 2c shows the taxonomic distribution of 488 RNA families predicted on the E. coli strain K-12 substr. MG1655 genome sequence (accession number: U00096). This is a dynamic graphic, allowing the navigation within the number of predicted RNA available in each evolutionary branches. In this example, 53 RNA families (11% of the total) are present exclusively on the Bacteria domain (light blue in Fig. 2c); while 413 RNA families (85% of the total) are shared between Bacteria and other evolutionary branches (light red in Fig. 2c).
Output files for downstream analysis
StructRNAfinder generates several files in different formats. They can be useful to advance downstream analyses or can supplement information available from HTML reports, obtained after running the tool. All files can be accessed on the Files section. Standard outputs from Infernal and RNAfold tools are available, together with other files generated by StructRNAfinder. Output files include: (i) a BED format file containing the positional coordinates of predicted structures according to the nucleotide sequences used as input; (ii) a FASTA format file containing the nucleotide sequences of the predicted RNAs; (iii) an annotation tabular file comprised of extensive information generated by StructRNAfinder. This annotation file contains the RNA family name, the RNA class, Rfam database identifiers, scores and e-values from each prediction according to Infernal and folding energies according to RNAfold, the start and end positions of each prediction on the query sequence, and finally, a functional description of the predicted RNA.
webStructRNAfinder: An user friendly web server
webStructRNAfinder server provides a job launcher interface (RUN section) where users can analyse sequences using different search methods according to their own criteria. Users are only required to provide a FASTA sequence(s) file, and to fill a small set of required parameters (Fig. 3). The parameters to choose are: (i) the Infernal search method (cmsearch, who searches the covariance models in a database composed by the input sequences; or cmscan, who searches the input sequences in a database composed by the covariance models. This difference influences the e-value calculation, due to this value mainly depend on the database size.); (ii) the cutoff filter to be used according to Infernal, based on: e-value, score, or one of the three covariance model-specific reporting thresholds (gathering, noise or trusted); (iii) the option to receive a report considering all significant hits according to selected e-value/score/CM-threshold or only the best one per sequence based on the lowest e-value; (iv) performs the search in both strands or only in one. As soon as StructRNAfinder finishes the whole analysis, the results will remain available on the provided hyperlink for 48 h. On the Files section, users can download a compressed file in zip format containing all output files and HTML web pages generated by the tool.
StructRNAfinder exemplary reports are made available in the RUN section. One with RNA families predicted in the genome of the eukaryotic human pathogen Leishmania braziliensis (Additional file 2); another in E. coli str k-12 genome (Fig. 2), both using the cmsearch Infernal method and filtered by an e-value of 0.01; and with RNA families predicted in a dataset of experimentally verified ncRNA sequences extracted from Sætrom and collaborators [19]. This last analysis predicted correctly 151 out of 154 (98.05% of the total) experimentally validated RNAs (Additional file 3).