A comparison of RNA folding measures

Background In the last few decades there has been a great deal of discussion concerning whether or not noncoding RNA sequences (ncRNAs) fold in a more well-defined manner than random sequences. In this paper, we investigate several existing measures for how well an RNA sequence folds, and compare the behaviour of these measures over a large range of Rfam ncRNA families. Such measures can be useful in, for example, identifying novel ncRNAs, and indicating the presence of alternate RNA foldings. Results Our analysis shows that ncRNAs, but not mRNAs, in general have lower minimal free energy (MFE) than random sequences with the same dinucleotide frequency. Moreover, even when the MFE is significant, many ncRNAs appear to not have a unique fold, but rather several alternative folds, at least when folded in silico. Furthermore, we find that the six investigated measures are correlated to varying degrees. Conclusion Due to the correlations between the different measures we find that it is sufficient to use only two of them in RNA folding studies, one to test if the sequence in question has lower energy than a random sequence with the same dinucleotide frequency (the Z-score) and the other to see if the sequence has a unique fold (the average base-pair distance, D).


Background
Noncoding RNAs (ncRNAs) are sequences that are transcribed from DNA that function as RNA rather than being translated to protein. Many of the known ncRNAs, such as transfer RNA (tRNA), ribosomal RNA (rRNA), spliceosomal RNA (snRNA), and microRNAs (miRNA), have key functions in the cell. Moreover, various new families of ncRNAs are emerging, and, as indicated in recent studies in mouse [1] and 10 human chromosomes [2], many more transcripts are for ncRNAs than was previously expected. In the late 1980's Maizel and co-workers proposed the use of thermodynamic stability to identify non-coding RNAs in sequence data [3][4][5]. Since then, there has been a great deal of discussion concerning whether or not ncRNA sequences support secondary structure features that are significantly different from those of random sequences. In particular, following some contradictory results concerning the stability of messenger RNAs (mRNA) presented in [6][7][8], in [9] it was concluded that ncRNAs have more stable structures than random sequences, but that the difference is not significant enough to be of use in identifying novel RNAs in sequence data on its own (see also [10]). Even so, more recent findings suggest that thermodynamic stability can be used to identify novel members of special families of RNAs [11], and that stability coupled with comparative genomics data is a useful tool for identifying ncRNAs in general [12].
To shed more light on the above findings, we present a large scale investigation for how well ncRNA sequences fold compared with random sequences. In particular, we investigate six measures for how well an RNA sequence folds (normalised energy (dG), Z-score (Z) and p-value (p) of minimal free energy (MFE), Shannon entropy (Q), average base pair distance (D), and valley index (VI), for definitions see the Methods section), and compare the behaviour of these measures over a large range of Rfam ncRNA families (see Table 1), including many of the families that appeared in the studies mentioned above.

Data sets
All data sets, except the protein control and the ribosomal RNA data sets, were obtained from Rfam 6.1 [13]. Rfam seed alignments were used to select a collection of RNA families, which are specified in Table 1. The rRNA data set consists of a large representative subset of the eukaryotic SSU rRNA sequences in the European rRNA database (see additional file 1 for further details).
For each class of RNA we obtained an alignment of sequences, which we filtered so that it had no more than 80% sequence identity. This was done using the program weight, that is part of the Sean Eddy "squid" utilities (downloaded 2004 from http://selab.wustl.edu/cgi-bin/ selab.pl?mode=software#squid).
In addition to the 13 data sets specified in Table 1, two control data sets were included; a protein control data set, consisting of 32 small protein coding sequences, and a set of shuffled RNA sequences (see additional file 1 for further details). The shuffled data set consists of 10 sequences from each of the 13 RNA data sets that were permuted, preserving dinucleotide frequencies [14], resulting in 130 sequences.

RNA folding statistics
Several quantities have been proposed for predicting how well an RNA molecule folds. In this paper we consider the following: The normalised minimal free energy (MFE) per base-pair (dG), the Z-score (Z), the p-value (p), the Shannon entropy (Q), the average base-pair distance (D), and the valley index (VI). We now present formal definitions for each of these measures. Let x = x 1 ʜ x L , denote an RNA sequence of length L, so that x i is either A, C, G or U for each 1 ≤ i ≤ L.
The normalised energy, dG, is arrived at from a free energy minimisation procedure. It is defined as Table 1: The data sets used in this study. The first column contains a short name describing the data set, that is later used in text and figures. The RNA families/data sets can contain several types of sequences, such as the RNase family that contains both RNase P and RNase MRP. The different sequence types, or family members, are given in column two. In column three and four the number of family members (N FM ) and the total number of sequences (N S ) are given, respectively. The last two columns in the table give the mean  and standard deviation of the sequence length and %GC- where E(x) is the minimal free energy (MFE) for sequence x, as computed using RNAfold [15]. This program implements the folding algorithm presented in [16].
The Z-score and the p-value compare the MFE of the sequence x to the MFEs of permuted versions of x having identical dinucleotide composition. These compositions are preserved due to the importance of stacked base-pairs in the calculation of MFE [8]. For each sequence in this study, 500 shuffled sequences were generated using a mono-and dinucleotide frequency preserving procedure implemented in the program shuffle that is part of the Sean Eddy "squid" utilities.
The Z-score [17] is the number of standard deviations by which the MFE of x deviates from the mean MFE of the set of shuffled sequences [6,8,9,17]. It is defined as where <·> and σ(·) denote the mean and the standard deviation of the MFEs of the sequences in .
The p-value of x is the fraction of sequences in having MFE lower than x or, expressed differently, the area under the distribution function to the left of the MFE of x. It is defined as where M is the number of sequences in with MFE lower than the MFE of x, and N is the number of shuffled sequences, | |.
In vivo, RNAs commonly exist in an ensemble of structures. The distribution of these structures can be modelled by a Boltzmann distribution. Using this setup, it is possible to efficiently compute the partition function, Z, for the ensemble of secondary structures corresponding to an RNA sequence x [18]. In particular, the probability of a structure S α (which we regard as a set of basepairs) is given by , where , E α is the free energy of S α , R = 8.31451 Jmol -1 K -1 is the molar gas constant, and T is the temperature, which we take as 310.15 K (37°C). The basepair probability p ij (the probability that x i pairs with x j ) is then given by is 1 if x i and x j is a base-pair in S α , and 0 otherwise.
We use the implementation of McCaskill's algorithm in RNAfold to compute base-pair probabilities. The normalised Shannon entropy of x [19] is then defined as We can also use base-pair probabilities to compute the average base pair distance between all structures in , <d BP > as follows (I.Hofacker, pers. commun.). (Version 1.5beta of RNAfold output this measure as "ensemble diversity".) The base-pair distance, d BP (S α , S β ) between two structures S α and S β on x is defined as the number of base-pairs not shared by the structures S α and S β (see e.g. [20]). Hence, if |S α | is the number of base-pairs in S α , i.e.
, where i and j lie between 1 and L, then the base-pair distance between structures S α and S β equals In particular,

Since
, <d BP > can thus be rewritten as Thus normalising by length, the average base-pair distance is given by The last measure that we consider in this study is the valley index (VI) [21]. It can be regarded as an approximation to D (see below), and is meant to measure the number of "valleys" in the RNA folding landscape of x.
Formally it is defined as follows: List the suboptimal structures of x according to their free energies so that S opt , an MFE structure for x, is first and S 1 ,..., S n are the next n struc- where is the Boltzmann factor, and Note that our definition of VI differs slightly from the Kitagawa et al.'s definition since we use normalised base-pair distance, d BPnorm , rather than the coarse-grained tree metric in their study. The suboptimal structures S 1 ,..., S n are randomly sampled with probabilities equal to their Boltzmann weights using the program RNAsubopt [22]. We sample 300 structures resulting in between 16 (regulatory) and 300 (telomerase) unique structures.
In principle, the valley index for an RNA with a low number of valleys in the folding landscape should be low, whereas an RNA with a multi-valley folding landscape should have a correspondingly higher index. Note that the sums in the definition on VI are taken over all structures in a set of suboptimal structures within a certain energy distance from the MFE. If the energy distance is increased this set of structures will eventually include all the sequences in the ensemble . In this situation, in view of the definition of w(α) it follows that the valley index of x can be rewritten as

Comparison of measures
The six measures that we investigated are correlated to varying degrees; see Table 2 and Figure 1. The measures Q and D are highly correlated (correlation coefficient = 0.98), which could be due to the fact that they are both computed using McCaskill base pair probabilities, p ij . Also, as expected, the Z-score and p-value are strongly correlated, but not in a linear fashion (see Figure 1). We see that the Z-scores are more sensitive for low values than the p-values (e.g. all Z-scores below -3 correspond to a p-value of 0.0), and so Z-scores are more informative.
The statistic dG is weakly correlated to all other measures. However, it is interesting to note that dG is negatively correlated to %-GC. This is to be expected since GC base pairs have lower energy than the other possible base-pairings. The miRNA family is an exception to this rule, since it has low dG values, but an average %-GC of about 50%, see Figure 1. Table 2 shows that the correlation between VI and the other measures is low over all families. However, Figure 1 indicates that for a subset of all the sequences the correlation between VI and Q or D is very strong. This is also confirmed by computing the correlation coefficients for the 15 RNA families separately. miRNA, SRP, tRNA, telomerase, and Hh1 show strong correlations (> 0.65) between VI and Q or D, whereas the corresponding correlations for rRNA, snRNA, riboswitch, regulatory, and snoRNA are weak (< 0.3).

Comparison between RNA families
In general, we deem an RNA sequence to have a stable secondary structure if the measures dG, Z, and p are significantly lower than the corresponding values for the shuffled control data sets. To check whether this was the case for the different data sets, we applied a Mann-Whitney rank sum test [23]. This test compares two data sets and computes the probability that the two data sets are sampled from the same distribution. Unlike the t-test, the Mann-Whitney test is distribution free since it compares the ranks of the data values instead of the data values themselves.
At a significance level of 99% the Mann-Whitney test indicated that Z and p are higher for the shuffled data set than for any of the real RNA data sets, except for mRNA and Hh1. The same held for the normalised energy dG, except  for the tmRNA, tRNA, regulatory and snoRNA families. This result agrees with those observed in [10], that ncRNAs have significantly lower Z-score than unstructured sequences. This can also be seen in Figure 2.

Correlations between measures
The measures Q and D can be used to indicate whether a sequence folds into a unique secondary structure or into several alternative structures [24]. The riboswitch data set consists of sequences known to have alternative structures, and so we expected the values of Q and D to be rather high for this data set. We did find this to be the case, but surprisingly they were also as high or even higher for other data sets (see Figure 2).
The high values of Q and D obtained for the mRNA and shuffled data sets is probably due to the fact that these RNAs are unstructured, and hence there are many alternative possible structures. This could also explain the values of Q and D for tmRNA, since tmRNAs are to a large extent mRNA-like (large parts of such molecules are unstructured). Other RNA families like tRNA and RNAse have tertiary interactions that aren't included in secondary structure, which explains their relatively high Q-and Dvalues. The interaction of rRNAs and snoRNAs with proteins and other RNAs most likely stabilise their native structures, even though alternative structures are possible.
The values of our measures for the telomerase sequences were unexpected. Telomerase has low energy per base, yet it has a rather high Z-score compared to the other ncRNAs. The high stability of this molecule is most likely due to an unusual sequence composition; the telomerase sequences have a high %-GC level, 65% (see Figure 3). The high values of Q and D suggest that the telomerase sequences have alternative structures.
The miRNAs have very stable structures, indicated by low Z and dG, especially in view of their %GC level (~50%). This has previously been observed in [11]. The miRNAs also have low values of Q, D, and VI, indicating a unique structure.

Comparison with previous studies
Seffens and Digby [6] examined 51 mRNA sequences and observed that they have lower folding energy than shuffled versions of the sequences preserving mono-but not dinucleotide frequencies. Shortly after, Workman and Krogh examined 46 of the 51 mRNAs and showed that they do not have lower folding energy than shuffled versions of the sequences, when the dinucleotide frequencies are preserved [8]. In our study, in which sequences were shuffled so as to preserve both mono-and dinucleotide frequencies, we confirm that mRNAs do not have lower folding energy than shuffled sequences. In [8] a small sample of rRNA and tRNA sequences were also investi-gated and it was indicated that rRNA, but not tRNA has lower folding energy than dinucleotide shuffled sequences. Our study, with significantly more data, agrees with their findings for rRNA, but differs for tRNAs, which we found to have significantly lower Z-scores than shuffled sequences. Rivas and Eddy [9] argue that secondary structure alone is generally not significant for the detection of ncRNA, but note that ncRNAs have slightly lower folding energies than shuffled sequences. Note that in [9] sequences are shuffled preserving mononucleotides only, whereas in our study we shuffled sequences preserving dinucleotide frequencies. Rivas and Eddy computed Zscores for a large set of tRNAs, and even though we adopt a different shuffling procedure, our results for tRNA are in good agreement with Rivas and Eddy's findings.
Kitagawa et al. [21] observed that five snRNAs have low folding energies compared to shuffled sequences. Our studies confirm this observation, and in general we found that snRNA sequences have lower folding energies than shuffled sequences with the same dinucleotide frequency. Kitagawa et al. also computed VI values for the same five snRNAs, and observed that the values varied considerably (indicating that some have uni-valley landscapes while other have multi-valley landscapes). Although we used a variant of VI, we also found that the VI value varies considerably for different snRNA sequences.
Bonnet et al. observed that miRNAs have considerably lower folding energy than dinucleotide shuffled sequences, unlike tRNA and rRNA [11]. Our studies confirm this observation, although Bonnet et al. investigated shorter regions of the rRNA, while we investigated full rRNA sequences.
In our study, we found the mean Z-scores (and p-values) to be significantly lower for ncRNAs (except the Hammerhead type I family) than for the shuffled sequences (although the Z-scores for mRNA were not lower). This is in agreement with recent results presented in [10], where it is shown that non-coding RNAs have lower Z-scores than coding RNAs for a selection of RNA families (tRNA, Hammerhead type III, a regulatory element (SECIS), SRP, snRNA (U1 and U2), mRNA (divided into coding sequence and 5'-and 3'-untranslated regions)).

Conclusion
We have studied six previously defined measures for predicting how well an RNA molecule is expected to fold (dG, Z, p, Q, D, and VI), and applied them to a large collection of RNAs from the Rfam database. We found all of these measures to be correlated to some degree. The measures Z and p are strongly correlated, but Z is more sensitive than p. Since dG is a measure of MFE it is strongly correlated to the nucleotide composition of the sequence, and so a low dG does not necessarily imply a stable structure. Hence, it is probably sufficient to use Z as opposed to p and dG. For the families that we used in this study, we found the mean Z-scores (and p-values) to be significantly lower for ncRNAs than for the shuffled sequences.
The three measures Q, D and VI can be regarded as measures of the ruggedness of the RNA folding landscape. Both Q and D are computed from the partition function and are thus strongly correlated, and so either of them is probably sufficient for measuring ruggedness. The valley index VI can be viewed as an approximation of the average basepair distance D (see Methods section), and so there is no advantage in computing VI, especially since it is slow to compute, whereas D can be computed efficiently. RNA families having high values of D (and Q) were either unstructured RNA sequences, long RNA sequences that fold with the help of proteins, or RNAs with alternative folds or pseudoknot structures.
Thus, in summary, we expect that rather than using all of dG, Z, p, Q, D, and VI to predict how well an RNA molecule folds, that it is sufficient to use only Z and D (or Q). Our studies suggest that a combination of Z-score and D value might be useful for identifying well-defined RNA structures, such as the miRNAs (in agreement with results presented in [11]), and, based on our results, we expect that variations of these measures (such as the alignment Box and whisker plots of length, %GC, and G/C ratio Figure 3 Box and whisker plots of length, %GC, and G/C ratio. Box and whisker plots displaying medians, quartiles and range of the sequence length, %GC, and G/C ratio for all our test data sets. The lines of the box are at the lower quartile, median, and upper quartile values. The box width is proportional to the number of sequences in the data set. The whisker lines extend from each end of the box to the most extreme data value or have a maximal length of 1.5 times the box height. Data points beyond the ends of the whiskers are marked by +.