Secondary structural entropy in RNA switch (Riboswitch) identification

Manzourolajdad, Amirhossein; Arnold, Jonathan

doi:10.1186/s12859-015-0523-2

Methodology Article
Open access
Published: 28 April 2015

Secondary structural entropy in RNA switch (Riboswitch) identification

Amirhossein Manzourolajdad^1,2 &
Jonathan Arnold^1,3

BMC Bioinformatics volume 16, Article number: 133 (2015) Cite this article

3645 Accesses
14 Citations
3 Altmetric
Metrics details

Abstract

Background

RNA regulatory elements play a significant role in gene regulation. Riboswitches, a widespread group of regulatory RNAs, are vital components of many bacterial genomes. These regulatory elements generally function by forming a ligand-induced alternative fold that controls access to ribosome binding sites or other regulatory sites in RNA. Riboswitch-mediated mechanisms are ubiquitous across bacterial genomes. A typical class of riboswitch has its own unique structural and biological complexity, making de novo riboswitch identification a formidable task. Traditionally, riboswitches have been identified through comparative genomics based on sequence and structural homology. The limitations of structural-homology-based approaches, coupled with the assumption that there is a great diversity of undiscovered riboswitches, suggests the need for alternative methods for riboswitch identification, possibly based on features intrinsic to their structure. As of yet, no such reliable method has been proposed.

Results

We used structural entropy of riboswitch sequences as a measure of their secondary structural dynamics. Entropy values of a diverse set of riboswitches were compared to that of their mutants, their dinucleotide shuffles, and their reverse complement sequences under different stochastic context-free grammar folding models. Significance of our results was evaluated by comparison to other approaches, such as the base-pairing entropy and energy landscapes dynamics. Classifiers based on structural entropy optimized via sequence and structural features were devised as riboswitch identifiers and tested on Bacillus subtilis, Escherichia coli, and Synechococcus elongatus as an exploration of structural entropy based approaches. The unusually long untranslated region of the cotH in Bacillus subtilis, as well as upstream regions of certain genes, such as the sucC genes were associated with significant structural entropy values in genome-wide examinations.

Conclusions

Various tests show that there is in fact a relationship between higher structural entropy and the potential for the RNA sequence to have alternative structures, within the limitations of our methodology. This relationship, though modest, is consistent across various tests. Understanding the behavior of structural entropy as a fairly new feature for RNA conformational dynamics, however, may require extensive exploratory investigation both across RNA sequences and folding models.

Background

Non-protein-coding RNA (ncRNA) elements play an important role in biological pathways, such as gene regulation [1-4]. It has been shown that conformational features of many such RNA elements play a major part in their biological function [5,6]. In bacteria, RNA structural rearrangements can have a major effect on the expression of their downstream coding sequences (reviewed by [7]), a process known as cis-regulation. A classic example, and one of the earliest such elements discovered, is the complex regulatory mechanism that takes place upstream of the tryptophan operon in Escherichia coli during its expression [8]. Regulation of the tryptophan biosynthetic operon, however, is achieved through very different mechanisms in other organisms, such as B. subtilis and Lactobacillus lactis (reviewed by [9]). With much attention given to protein-coding genes in the past, the introduction of ncRNAs gene finders have become a relatively new area of genomic research [10]. Currently, many general-purpose [11-13] as well as ncRNA-specific gene finders, such as [14-16] are available.

Riboswitches

An interesting group of RNA regulatory elements are riboswitches. Riboswitches are defined as regulatory elements that take part in biological pathways by selectively binding to a specific ligand or metabolite, or uncharged tRNAs, without the need for protein factors. Environmental factors such as pH [17], ion concentration [18-20], and temperature [21,22] can also trigger RNA conformational changes affecting gene regulation. Nearly all riboswitches are located in the non-coding regions of messenger RNAs [23] and are capable of regulating genes through both activation and attenuation of either transcription or translation (reviewed by [24]). Finally, other factors such as the transcription speed of RNA polymerase, the folding and unfolding rates of the aptamer of the riboswitch, and the binding rates of the metabolites add other dimensions to categorizing riboswitches. These and other factors influence the RNA switching mechanism to be kinetically or thermodynamically driven. In addition to thermodynamics-based approaches, RNA-kinetics have been gaining momentum in riboswitch-mediated regulation studies at the system level. Lin and Thirumalai [25] introduces a kinetic feedback-loop network model that describes the functions of riboswitches using experimental data from flavin mononucleotide (FMN) riboswitch.

Originally found through sequence homology upstream of bacterial coding regions [26-28], riboswitches have been shown to be more abundant than previously expected. They have also been found in cooperative or tandem arrangements [23]. It is speculated that there are at least 100 more undiscovered riboswitches in already sequenced bacterial genomes [23]. Conformational factors are essential to ligand-binding specificity of riboswitches. Many riboswitches can discriminate between similar small molecules with the aid of their structural geometry. For instance, the thiamine pyrophosphate (TPP) and S-adenosylmethionine (SAM) riboswitches measure the length of the ligand that binds to them [29-31].

RNA secondary structure

The secondary structural topology of the RNA is very effective in scaffolding the tertiary conformation. Secondary structure mainly consists of a two-dimensional schema that depicts the base-pairing interactions within the RNA structure and is dominated by Watson-Crick base-pairing. One major computational method to predict RNA secondary structure is minimization of its free energy (MFE) within a thermodynamic ensemble, such as the Boltzmann ensemble [32,33]. State-of-the-art thermodynamic models have proven to be effective in RNA secondary structural predictions in most cases. An example of where such predictions fail would be Hammerhead type I ribozyme where loop tertiary interactions have a dominating effect on the structural conformation [34]. Centroids of the Boltzmann ensemble are also used for RNA secondary structural predictions [35]. In many cases, such a prediction is more similar to the structure inferred from comparative sequence analysis than the MFE structure is [35]. In addition, Stochastic context-free grammars (SCFG) have shown to be effective in secondary structural prediction of various RNA regulatory elements. Nawrocki and Eddy, 2013 [13] have shown that more sophisticated grammars, designed to mirror the thermodynamic models can improve the prediction accuracy of structures, once trained on known RNA structures based on maximum-likelihood criteria^a.

Most of the discovered prokaryotic RNA regulatory elements (including riboswitches) are located upstream of the genes they regulate. They act as cis-regulatory elements and exhibit strong secondary structural conservation. Some exceptions to cis-regulation are two trans-acting SAM riboswitches [36] and an antisense regulation of a vitamin B ₁₂-binding riboswitch [37] in Listeria monocytogenes. Insights into structural and functional complexity of riboswitches already discovered are offered in [38]. Purine riboswitches are good examples of secondary structural conservation. The add adenine riboswitch from V. vulnificus and the xpt guanine riboswitch from B. subtilis have very similar secondary and tertiary conformations, despite different crystal packing interactions, pH, and Mg crystallization conditions [39]. In fact, investigation of secondary-structural homology upstream of genomic regions containing the same genes has led to the discovery of more cis-regulatory elements in bacteria [40,41], making them the major current approach for riboswitch identification.

The fact that riboswitch discovery is mainly based on homology makes it difficult to assess how much secondary structural conservation is expected to be prevalent in undiscovered riboswiches. Furthermore, structural homology is not always successful in finding riboswitches. Despite [42]’s rigorous sequence and structural homology searches based on the SAM-I riboswitch, the SAM-IV riboswitch could not be detected. The authors further hypothesized that the structural diversity of riboswitches could be far greater than what has been already observed. Serganov and Nudler, 2013 [38] suggest that there may not even be an interconnection between the structures of riboswitches and the nature of their cognate metabolites and consequently, the biochemical and structural information gathered so far may not be as useful in riboswitch validation as expected. The above limitations of homology-based riboswitch identification methods indicate the need for an alternative approach.

Conformational dynamics

While secondary-structure conformational features are very descriptive of many classes of riboswitches, their folding dynamics are also critical. A typical example is the TPP riboswitch which can fold into alternative structures depending on the presence of the TPP ligand. The tertiary structure stabilized in the presence of TPP is shown in Figure 1A [43]. Both the ligand-bound and the unbound secondary structures necessary for TPP riboswitch regulatory function are shown in Figure 1B. One of the major computational tools to explore possible folding trajectories is the free energy landscape. The free energy landscape was originally defined for protein folding [44]. In a typical RNA free energy landscape, possible conformations are shown with their corresponding free energy and pairwise distances from one another. In an effort to investigate the thermodynamic equilibrium of RNA folding, Quarta et al. [45] presented a case study of the energy landscape of the TPP riboswitch where the base-pairing distances between the structural possibilities form two major clusters. The clusters corresponded to native and ligand-bound structural conformations. After repeating this process for various choices of elongation of the TPP riboswitch, they showed that for certain ranges of length, each cluster corresponds to one of the two structures of the riboswitch (see Figure 1C).

In [46], the dynamics of energy landscapes across elongation of various riboswitches were investigated and it was shown that such landscapes have different clustering dynamics across kinetically and thermodynamically driven riboswitches. This work highlights the fact that even in a kinetically-driven regulation scenario, investigation of the dynamics of the thermodynamic equilibrium across the elongation can be informative. In a more recent work, energy landscape analyses led to strong evidence of evolutionary co-variation of base-pairs that favor a conserved alternative structure of the purine riboswitch [47]. In addition, prediction of structural switching in RNA has been addressed by [48,49] using abstract shapes to represent different secondary structural conformations. Freyhult et al. 2007 [50,51] examined the lowest free energy structural conformations having a certain base-pairing distance to the actual structure of the RNA to explore the structural neighbors of an intermediate, biologically active structure. A more recent work [52] presents an ingenious and a significant decrease of computational consumption of estimating the likelihood of structural neighbors. However, to date there is no computational method that can identify the diverse and structurally complex riboswitches with high confidence.

Investigation into the folding dynamics of the nascent RNA based on free energy sampling and pair-wise distances can be computationally costly. Finding a sample size that sufficiently reflects the RNA folding space behavior can be difficult and prone to model parameter biases. Furthermore, even if optimized parameters and sufficient samples were available, it would still be difficult to make comparisons across RNA elements. The latter is mainly due to the fact that the characteristics of such folding distributions (here, free energy vs. structural distance within a given ensemble of secondary structures) are not well understood.

One statistic to evaluate the distribution characteristics of any probabilistic model is the Shannon entropy [53]. While the conformation with maximum-likelihood under a given SCFG is referred to as the optimum structure under that model, all of the other sub-optimal conformations can be associated with a probability. Hence, the Shannon entropy (expected log-likelihood) of such a probabilistic folding space is \(H(S)=-\sum _{s\in S} p(s)\log {p(s)}\), where S is the folding space containing all possible secondary structures s valid on the desired RNA sequence, each of which associated with the corresponding probabilities of occurrence p(s). Here, the notion of probability can also be interpreted as the frequency of occurrence of a particular conformation for the RNA sequence. Alternative formulations and approximations of Shannon entropy exist in RNA secondary structure studies, such as [54]. Exact calculations of Shannon entropy under a given SCFG as a probabilistic secondary structural folding model, however, was done in [55] and shown to be computationally convenient achievable in polynomial time O(n ³), where n is the length of the RNA sequence. In an independent work, [56] also offered an algorithm to calculate the Shannon entropy of the stochastic context-free grammar BJK [57] with parameter sets derived from a given alignment. Other measures of structural diversity such as ensemble diversity computed by RNAfold -p in the Vienna RNA Software Package [58] also exist. In this work, structural diversity is measured by the exact RNA secondary structural information theoretic-uncertainty (or here, Shannon entropy) of the complete SCFG-modeled folding space of the RNA, as computed by [55]. From hereon, we refer to this measure as structural entropy. We investigated the significance of structural entropy of RNAs with more than one biologically functional secondary structural conformation. A diverse set of prokaryotic RNA elements, validated to have such potential were used for this purpose. The performance of structural entropy to distinguish riboswitches was compared to other similar features under different negative-control sets. We then made an attempt to develop a computational method for riboswitch identification via structural entropy on a genome-wide level. The goal of the presented results of the genome-wide tests, however, is mainly exploratory and aim to investigate the genomic regions or elements that the developed method is highly sensitive to.

It has been previously shown that both high and low structural entropy values of certain classes of ncRNAs can be potentially significant. For instance, for certain riboswitches, GC-composition was co-associated with significantly high structural entropy, regardless of model accuracy to RNA secondary structure [55]. This observation raised the possibility that RNAs under selective pressure to have alternative folds, may have higher (not lower) structural entropy than expected. As discussed previously in [55], this seemingly nonintuitive observation is not theoretically impossible. The above intuition lies at the center of the proposed methodology, as will be shown.

Our approach

Folding models

The folding model for which the structural entropy of the RNA is computed is very critical. SCFG folding models can be very lightweight and consist of only few grammar rules and parameters, or they can be very sophisticated consisting of thousand parameters [13,59]. In [55], it was shown that the structural entropy value is very model sensitive. On the other hand, parameters of SCFG models are usually set by maximizing their prediction accuracy using maximum-likelihood approaches. There is no guarantee, however, that folding models optimized for such criteria also preserve information about folding dynamics of such RNAs. Increasing the accuracy of folding models under current approaches may be done at the expense of altering the folding space of possible structures under that model, thus losing the information about folding dynamics of the RNA. In order to avoid potential biases in our preliminary examination, it was essential that we include models not trained to best predict secondary structure in addition to models that do. Two different SCFG models were chosen for this study, one being a structurally unambiguous SCFG model with parameters trained to best predict RNA secondary structure, and one being a structurally ambiguous model with symmetric rules and probabilities. The theoretical implications of structural ambiguity may fall outside the scope of this work and the interested reader can refer to [55]. Here, we merely treat them as two different folding models.

Gathering data

There is a significant amount of sequence and/or structural similarity within each class of riboswitch. This is due to the fact that these riboswitches have been discovered using sequence and/or structural homology. Here, however, we are interested in capturing the universal characteristics of RNAs with alternative fold(s), mainly riboswitches, as a basis for an identification method for conformational switches. In order for our method to be less biased towards a specific structural conformation, we avoided using homologous RNA sequences or sequences that belong to closely related organisms, where possible. We also resorted to only evaluating riboswitches that have been experimentally validated to be functional rather than computationally discovered ones. The data set gathered in this work is a compromise between the above considerations and the need to include a diverse set of riboswitches in our data set. Although the attempt to computationally extract a universal feature from the diversity of prokaryotic riboswitches each having unique structural and biological characteristics is a great oversimplification, it serves as a common ground for comparing various features that aim to capture the RNA conformational dynamics as a whole.

Negative controls

One of the main challenges of our test, was the preparation of a reliable negative control. Folding models deployed here are very lightweight and simplistic, giving rise to potential unrelated dependencies to the factors such as genomic composition of RNA sequence. Therefore, gathering real biological sequences that are as similar to RNA sequences as possible while not having potential for alternative fold(s) is very critical to the significance of our test. Here, we relied on the following sets of negative controls: 1. dinucleotide shuffles of riboswitches (generated using [60]), 2. Mutagenesis; Structural mutants of the gathered sequences experimentally tested for not being functional, 3. The reverse complements (or antisense sequences) of gathered riboswitches, and 4. Sequences of the non-coding regions that are likely to be riboswitches. The choice of antisense as a negative control is explained in the Methods section.

Comparison to other methods

Two additional measures of structural diversity were used to assess the significance of structural entropy values in collected data. The first measure was the base-pairing entropy [54] of the BJK model BJKbp as defined in ([61] Eq. 3). For more information see the Methods section. The second measure, denoted as Sil, was obtained from clustering the RNA energy landscape. The Sil value reflects how well the energy landscape clusters into two. Calculations for Sil were according to [46]. We then compared the performance of classifiers designed to distinguish riboswitches from various negative controls. In order to evaluate the performance of structural entropy to detect alternative fold, we compared it to measures from RNAShapes [49] and FFTbor [52] predictions. These measures corresponded to energy disequilibrium of alternative folds: p1/p2, where p1 was the highest value in the predictions of the corresponding software and p2 was the second highest value. For RNAShapes, p1 is the probability of the most likely abstract shape of structure, whereas p2 is the second most. For FFTbor, p1 is the probability of the MFE structure and p2 is the probability of an alternative folding scenario where the structure has a particular base-pair distance with the MFE structure. Features used in this work are shown in Table 1. Please see the Methods section for further details.

Table 1 List of various sequence and structural features used throughout the work

Secondary structural entropy in RNA switch (Riboswitch) identification

Abstract

Background

Results

Conclusions

Background

Riboswitches

RNA secondary structure

Conformational dynamics

Our approach

Folding models

Gathering data

Negative controls

Comparison to other methods

Results and discussion

Mutagenesis

Sense-antisense classification results

Random shuffles

Association with high entropy

Choice of model

Genome-wide analysis

The cotH gene

The BSU tRNA 75 operon

lysP

Cross-organism riboswitch candidates

Conclusion

Methods

Data collection

Mutagenesis

Classification

Preparing the positive control set

Choice of reverse-complement (Antisense) as a negative control

Training and test sets

Classification criterion

Genome-wide scan of the B. subtilis, E. coli, and S. elongatus intergenic regions

Cross-organism riboswitch candidate selection

Endnotes

Appendix

Additional tables

Collected riboswitch sequences

Training set

Test set

Excluded set

Genome-wide scan results

Bacillus subtilis

Escherichia coli and Synechococcus elongatus

Genome-wide scan results: tables

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us