An enhanced computational platform for investigating the roles of regulatory RNA and for identifying functional RNA motifs

Background Functional RNA molecules participate in numerous biological processes, ranging from gene regulation to protein synthesis. Analysis of functional RNA motifs and elements in RNA sequences can obtain useful information for deciphering RNA regulatory mechanisms. Our previous work, RegRNA, is widely used in the identification of regulatory motifs, and this work extends it by incorporating more comprehensive and updated data sources and analytical approaches into a new platform. Methods and results An integrated web-based system, RegRNA 2.0, has been developed for comprehensively identifying the functional RNA motifs and sites in an input RNA sequence. Numerous data sources and analytical approaches are integrated, and several types of functional RNA motifs and sites can be identified by RegRNA 2.0: (i) splicing donor/acceptor sites; (ii) splicing regulatory motifs; (iii) polyadenylation sites; (iv) ribosome binding sites; (v) rho-independent terminator; (vi) motifs in mRNA 5'-untranslated region (5'UTR) and 3'UTR; (vii) AU-rich elements; (viii) C-to-U editing sites; (ix) riboswitches; (x) RNA cis-regulatory elements; (xi) transcriptional regulatory motifs; (xii) user-defined motifs; (xiii) similar functional RNA sequences; (xiv) microRNA target sites; (xv) non-coding RNA hybridization sites; (xvi) long stems; (xvii) open reading frames; (xviii) related information of an RNA sequence. User can submit an RNA sequence and obtain the predictive results through RegRNA 2.0 web page. Conclusions RegRNA 2.0 is an easy to use web server for identifying regulatory RNA motifs and functional sites. Through its integrated user-friendly interface, user is capable of using various analytical approaches and observing results with graphical visualization conveniently. RegRNA 2.0 is now available at http://regrna2.mbc.nctu.edu.tw.


Background
Numerous functional RNA motifs have been identified as playing significant roles in many essential biological processes, including transcriptional and post-transcriptional regulation of gene expression, control of mRNA stability, alternative splicing, and transcription termination. The biological activities of functional RNA motifs usually rely on a combination of their primary sequences and specific secondary structures, which act as target sites of RNAbinding factors or directly interact with translation machinery [1]. For instance, riboswitches are metabolitebinding domain within a specific mRNA, and can regulate both transcription and translation by binding their corresponding targets [2,3].
Several databases were established for collecting functional RNA molecules [1,[4][5][6][7][8][9][10][11][12][13][14]. UTRdb [1] is a database of 5' and 3' untranslated sequences of eukaryotic mRNAs. It provides specialized information including the presence of nucleotide sequence patterns already demonstrated by experimental analysis to have some functional roles, and these patterns have been collected into the UTRsite database. Rfam [4,5] is a database comprehensively collecting families of non-coding RNA (ncRNA) genes as well as cisregulatory RNA elements. Each family is represented by a multiple sequence alignment of known and predicted representative members, and annotated with a consensus base-paired secondary structure. It facilitates the identification and classification of new members of known RNA families, and provides the glimpses of conservation of multiple ncRNA families across a wide taxonomic range. fRNAdb [6,7] is a database hosting a large collection of ncRNA sequence data from public non-coding databases, and provides related annotations, such as sequence ontology classification and source organisms. AEdb is a database for alternative exons and their properties from numerous species, and it forms the manually curated component of alternative splicing database (ASD) [8]. The data in AEdb is gathered from literature where these exons have been experimentally verified. The adenylate uridylate-rich elements (AREs or AU-rich element) mediate the rapid turnover of mRNA encoding proteins that regulate cellular growth and body response to exogenous agent such as microbes and environmental stimuli. ARED [9,10] is a human AU-rich element-containing mRNA database. A 13-bp ARE pattern was computationally derived using MEME, and five clusters were generated from ARE sequences. NONCODE [11,12] is an integrated knowledge database designed for analysis of ncRNAs. All ncRNAs in NONCODE were confirmed by consulting the references manually and more than 80% data are from experiments. microRNAs (miRNAs) are small RNA molecules, which are~22 nt sequences, and participate in gene post-transcriptional regulation and degradation of mRNA by hybridizing to miRNA target sites. miRBase [13] is the central online repository for miRNA nomenclature, sequence data annotation and target prediction. It provides a range of data to facilitate studies of miRNA genomics. TRANSFAC [14] is a knowledge-base containing published data on eukaryotic transcription factors, their experimentallyproven binding site, and regulated genes.
Various approaches were developed for identifying functional RNA motifs or elements [15][16][17][18][19][20][21][22][23][24][25][26]. GeneSplicer [15] was developed for detecting splice sites in eukaryotic mRNA by combining several techniques, such as maximal dependence decomposition (MDD) and Markov model, that have already proven successful in characterizing the patterns around the donor and acceptor sites. polya_svm [16] was developed for predicting mRNA polyadenylation site using a Support Vector Machine (SVM) featuring 15 over-represented cis-regulatory elements in various regions surrounding. RBSfinder [17] is a probabilistic method to improve the accuracy of gene identification systems at finding precise translation start sites. TransTermHP [18] can rapidly and accurately detecting rho-independent transcription terminators. CURE [19] was developed for predicting C-to-U RNA editing site in plant mitochondria by incorporating both evolutionary and biochemical information. miRanda [20] was developed for finding genomic targets for miRNAs. RiboSW [21] is a systematic method for identifying 12 kinds of riboswitches based on RNA conserved functional sequences and conformations. PatSearch [22] was developed for searching specific combinations of oligonucleotide consensus sequences, secondary structure motifs and position-weight matrices (PWMs). ERPIN [23] is a practical approach for the automatic derivation of an RNA signature from a sequence alignment and secondary structure, and finding the occurrence in sequence databases. Several profiles have been constructed to search any input sequence for the presence of some RNA genes and elements on ERPIN web server. INFERNAL [24] is an implementation of a general stochastic context-free grammars (SCFG) based approach for RNA database searches and multiple alignment. It is used to annotate RNAs in genomes in conjunction with the Rfam families by covariance models, a special case of SCFGs designed for modeling RNA consensus sequence and structure. MATCH [25] is an approach for searching transcription factor binding sites with specific position-weight matrices (PWM). RNAMotif [26] is an RNA secondary structure definition and search algorithm, and commonly used for searching user-defined RNA motifs.
Analysis of functional RNA motifs and sites in RNA sequences can obtain useful information for deciphering RNA regulatory mechanisms. Our previous work, RegRNA [27], is widely used to identify regulatory motifs and miRNA target sites, and has been cited 50 times. However, various types of functional RNA motifs and identification approaches were continuously accumulated and developed in recent years. In order to comprehensively identify functional RNA motifs, a more complete and updated analysis platform is crucial.
This work presents an integrated web server, RegRNA 2.0, for identifying functional RNA motifs and sites in an input RNA sequence. Numerous data sources, such as Rfam [4], fRNAdb [6] and UTRsite [1], and identification approaches, such as GeneSplicer [15], RiboSW [21] and RBSfinder [17], were integrated in RegRNA 2.0, and other additional information, such as GC-content ratio and RNA accessibility, are also presented on the web page. User can submit an RNA sequence through our user-friendly interface, and obtain the predictive results with graphical visualization.

Development of identification procedures
Numerous analytical approaches and data sources were integrated in RegRNA 2.0 (Table 1). GeneSplicer [15], polya_svm [16], RBSfinder [17], TransTermHP [18], CURE [19], RiboSW [21], and ERPIN [23], are incorporated for identifying splicing sites, polyadenylation sites, ribosome binding sites, Rho-independent terminator, C-to-U editing sites, riboswitches, and RNA elements, respectively. MATCH [25] is used with matrices collected in TRANSFAC [14] to provide the possibility to search for a variety of different transcription factor finding sites. Pat-Search [22] and UTRsite models are integrated for indentifying UTR motifs. INFERNAL [24] and Rfam CMs are integrated for identifying cis-regulatory families. miRanda [20] and miRNA sequences of miRBase are integrated for identifying miRNA target sites. BLAST [29] and sequences of fRNAdb is integrated for finding similar functional RNA sequences. The einverted of EMBOSS package [30] is utilized for identifying long stems, which might be involved in mechanisms of gene regulatory processes [31][32][33]. For identifying putative RNA-RNA interaction sites, BLAST is used to find the complementary subsequence of input sequence against NONCODE database, and RNAcofold of Vienna RNA Package [34] is used to compute the free energy of hybridization sites. RNAMotif [26] is integrated for searching user-defined RNA motifs. In addition, RegRNA 2.0 is capable of predicting ORFs of the input RNA sequence. The default options are for resulting protein of at least 80 amino acids beginning with a start codon (AUG, GUG or UUG) and ending with a stop codon (UAA, UAG or UGA). The fully overlapped ORFs are not shown. Other related information, such as GC-content ratio and RNA accessibility, are also provided for the input RNA sequence. RNAplfold and RNAfold of Vienna RNA package [34] are used for predicting RNA accessibility and RNA secondary structure, respectively.

User interface
An integrated web-based system with user-friendly interface (Figure 2) was developed to facilitate user conveniently and comprehensively identifying functional RNA motifs and sites in an RNA sequence. User can submit a sequence by inputting a single sequence in FASTA format, or uploading a sequence file (Figure 2a), and the predictive results are presented via a graphical interface. User can decide which types of functional RNA motifs to be investigated by clicking the checkbox (Figure 2b). All parameters were set with default values, and user can alter the thresholds to fit their requirement. For instance, in predicting miRNA target sites,  [34] users can select the specie and adjust the minimum free energy (MFE) threshold and score threshold to filter miRNA targets of interest. RegRNA 2.0 provides an intuitive graphical visualization (map view, Figure 2c) and summarized information table (table view, Figure 2d) for predictive results. The graphic location maps are created for intuitively displaying the positions of predictive motifs. The top-most graph shows the predictive ORFs, and the following graphs shows the predictive functional RNA motifs or sites. User can see the brief introduction of a predictive motif, such as the name, the start/end positions and the binding factors, by moving the cursor on it, and a pop-up description will be shown on the screen directly (Figure 2e). Further analysis and additional information of a predictive motif, such as the predictive secondary structure and the corresponding RNALogo [35] graph, can be observed by clicking on the motifs of interest (Figure 2f). The details of predictive results can be obtained in summarized information table (Figure 2g).

A case study of identification of purine riboswitch
The purine riboswitch is used as a case study to demonstrate the capability of RegRNA 2.0. Purine riboswitches, which are found in the 5'UTR of mRNAs act as cis-acting genetic regulatory elements composed of a metaboliteresponsive aptamer domain in a specific secondary structure. It can regulate both transcription and translation by binding their corresponding targets. Additional file 1 illustrates a cartoon representation of the mechanism of genetic regulation by the guanine riboswitch [36]. In the presence high concentrations of guanine or hypoxanthine, ligand binding stabilizes the three-way junction structure, allowing the mRNA to form the terminator element (cyan). Without ligand binding, the 3'side of the P1 stem (green) and the 5'side of the terminator are used to form an antiterminator element, allowing transcription to continue.
An RNA sequence with the accession number of EMBL, X83878, was used as an input for RegRNA 2.0. There exist a purine riboswitch and an operon of two genes, B. subtilis xpt and pubX, in X83878 according to the annotations of Rfam and EMBL database. The total length of X83878 is 2413 bps, and the location of purine riboswitch is from position 168 to 276. The location of CDS regions of xpt and pubX are from position 357 to 941 and from position 938 to 2254, respectively. Figure 3 shows the RegRNA 2.0 predictive results of the case study. The locations of two CDS regions were correctly predicted (Figure 3a), and the terminator of this operon was recognized by two RegRNA 2.0 identification procedures, TransTermHP and ERPIN (Figure 3b). The location of purine riboswitch was identified by three RegRNA 2.0 identification procedures, RiboSW, INFER-NAL and BLAST fRNAdb (Figure 3c). A crucial RNA secondary structure, terminator, for purine riboswitch regulating gene activity was also predicted (Figure 3d), and the location of this terminator is close following the predictive purine riboswitch that corresponds to the mechanism of genetic regulation of the purine riboswitch [36]. In addition, the MFE secondary structure of predictive purine riboswitch regions shows the similar conformation to the RNALogo graph of Rfam purine family (Figure 3e). The results of case study show that RegRNA 2.0 is capable of identifying and displaying useful information in a given RNA sequence, and helpful for observing and deciphering RNA regulatory mechanisms.

Discussions and conclusions
RegRNA 2.0 facilitates user to identify functional RNA motifs and sites in an RNA sequence. As compared with our previously work, RegRNA [27], RegRNA 2.0 incorporates more data sources and analytical approaches ( Table 2). RegRNA 2.0 enables user to identify more types of functional RNA motifs and sites including polyadenylation sites, ribosome binding sites, rho-independent terminator, AU-rich elements, RNA editing sites, RNA cis-regulatory elements, similar functional RNA sequences, non-coding RNA hybridization sites, long stems, open reading frames and related information of an RNA sequence. Additionally, RegRNA 2.0 provides further analysis, such as RNA secondary structure, RNA accessibility and RNALogo graph, for the predictive results, and display results with intuitive graphical visualization.
RegRNA 2.0 is an easy to use web server for comprehensively identifying regulatory RNA motifs and functional sites. It extends the widely used analysis platform, RegRNA [27], by taking more types of motifs and analytical approaches into consideration. RegRNA 2.0 is convenient to use programs without having to download the code and get the programs to run. Through its integrated userfriendly interface, user is capable of using various analytical approaches and observing results with graphical Figure 3 The RegRNA 2.0 predictive results of an input sequence, X83878, which is annotated with an operon of two genes and a purine riboswitch in EMBL and Rfam, respectively. (a) predictive ORFs (b) terminator (c) purine riboswitch (d) terminator (e) predictive structure and RNALogo of purine family. visualization conveniently. The platform will be enhanced by supporting input of multiple RNA sequences and providing conservation analysis in the future.