# Quantifying variances in comparative RNA secondary structure prediction

- James WJ Anderson
^{1}Email author, - Ádám Novák
^{1}, - Zsuzsanna Sükösd
^{2}, - Michael Golden
^{3}, - Preeti Arunapuram
^{4}, - Ingolfur Edvardsson
^{5}and - Jotun Hein
^{1}

**14**:149

https://doi.org/10.1186/1471-2105-14-149

© Anderson et al.; licensee BioMed Central Ltd. 2013

**Received: **30 November 2012

**Accepted: **21 March 2013

**Published: **1 May 2013

## Abstract

### Background

With the advancement of next-generation sequencing and transcriptomics technologies, regulatory effects involving RNA, in particular RNA structural changes are being detected. These results often rely on RNA secondary structure predictions. However, current approaches to RNA secondary structure modelling produce predictions with a high variance in predictive accuracy, and we have little quantifiable knowledge about the reasons for these variances.

### Results

In this paper we explore a number of factors which can contribute to poor RNA secondary structure prediction quality. We establish a quantified relationship between alignment quality and loss of accuracy. Furthermore, we define two new measures to quantify uncertainty in alignment-based structure predictions. One of the measures improves on the “reliability score” reported by PPfold, and considers alignment uncertainty as well as base-pair probabilities. The other measure considers the information entropy for SCFGs over a space of input alignments.

### Conclusions

Our predictive accuracy improves on the PPfold reliability score. We can successfully characterize many of the underlying reasons for and variances in poor prediction. However, there is still variability unaccounted for, which we therefore suggest comes from the RNA secondary structure predictive model itself.

## Keywords

## Background

RNA secondary structure prediction is still an important problem in computational biology. With the advent of next generation sequencing and RNA-seq technologies, many RNA structural changes are being found to play important roles in regulating gene expression [1, 2]. Gene regulation studies can now be done on a genome-wide scale. In some cases RNA secondary structures can be experimentally determined on a genome-wide level [3], but these methods require RNA isolation and many not preserve *in vivo* structures. RNA secondary structure prediction programs are still often used to predict structures across the genome [4]. The predicted secondary structures, and predicted structural changes, are being used to find relationships and suggest mechanisms in gene regulatory networks.

Some methods for RNA secondary structure prediction only consider a single sequence as input. However, prediction quality can be improved by using multiple sequences, assuming that RNA secondary structure is conserved through evolution. Even without a complex evolutionary model, these additional structural constraints provide valuable information on folding. Comparative methods for RNA secondary structure prediction are based on this observation, and use evolutionary information from multiple alignments to improve prediction quality.

Methods for RNA secondary structure prediction generally fall into two categories. Thermodynamic models make use of free-energy functions, which take experimentally determined energy parameters for individual structural elements. Dynamic programming is then used to find the secondary structure with the minimum free energy, which is reported as the predicted structure. This has been successfully implemented in programs such as RNAfold [5] and UNAfold [6]. Thermodynamic methods typically deal with the single-sequence prediction problem, but extensions such as RNAalifold [7] and PETfold [8] allow for comparative prediction.

Stochastic context-free grammars (SCFGs), on the other hand, define a probability distribution over the space of RNA secondary structures. Posterior decoding techniques are typically used to determine, for example, the maximum expected accuracy structure [9]. SCFGs have been used for the single-sequence prediction problem [10], but their advantage comes through coupling with a molecular evolution model. The first comparative SCFG-based approach was developed in Pfold [11, 12], where alignment column probabilities were determined through single and paired column evolution models, calculated via the Felsenstein pruning algorithm [13]. For a more complete review on RNA secondary structure prediction, see [14].

Consequently, it is important to understand more about the variability of RNA secondary structure prediction programs. In comparative prediction, there are many sources of variability: alignment quality, the number of sequences chosen and which alignment samples have been taken, the evolutionary relationships between sequences, as well as the ill-conditioned nature of the folding model itself. Understanding and quantifying these variances is key for biological applications that rely on these folding programs. Additionally, other bioinformatics software that utilize these folding programs- for example inverse RNA folding algorithms [17]- may often experience a fundamental limitation in performance due to variance in structure prediction quality.

Sequence alignment is a fundamental step in most comparative sequence analysis pipelines. The typical approach is to create a single, trusted multiple alignment of the sequences using methods based on an artificial scoring scheme and heuristics to find a highly scoring alignment [18, 19]. Although this methodology is successful when the alignment is well resolved, it has been shown in the context of downstream analyses that the end result can be highly sensitive to the choice of alignment [20-22]. RNA secondary structure prediction methods take a variety of approaches with respect to possible errors in RNA alignments. Some methods (e.g. [23]) invoke a fold-and-align approach directly, where alignment is done simultaneously with structure prediction. Pfold, instead, takes a fixed alignment as input, but allocates a nite probability to a nucleotide being any other nucleotide; this makes the model more robust to poor alignment quality. Most modern methods (e.g. [24]) still consider prediction from a single, fixed, alignment. Recently, alignment-free methods have also been proposed [25]. However, even after considering poor alignment quality, there are many additional variances associated with poor comparative RNA secondary structure prediction.

Sequence selection is another important variable. Alignment methods may produce poor alignments due to poor individual sequences, which will in turn produce poor structure predictions. There have been methods developed to select homologous sequences, particularly [26], which is based on evolutionary models and structural constraints. This is implemented in [27], which shows strong results in RNA secondary structure prediction can be found by selecting useful sequences.

In this paper we consider the problem of variances in comparative RNA secondary structure prediction. We present statistical analyses of different variances including the relationship between structure prediction quality and chosen alignment distance from the reference alignment, and a predictive algorithm for accuracy is provided. Factors like the number of sequences in the alignment and the evolutionary distance of the sequences are considered. Finally, a novel method is presented which extends information entropy for stochastic context-free grammars [28] to consider variation over alignments.

### Statistical alignment

Statistical alignment [29] provides a solution to many of the issues encountered with the traditional approach of sequence alignment. It models sequence evolution as a stochastic process involving sequence insertions, deletions and character substitutions, which defines a probability distribution over the alignments of the sequences. Using techniques such as Expectation Maximisation or Markov Chain Monte Carlo (MCMC), it is possible to either maximize the likelihood of the alignment, or generate a representative set of putative alignments by sampling from the alignment distribution.

Several statistical alignment implementations have emerged in past years [30-33], some of which allow co-sampling other entities such as the evolutionary tree or the locations of cis-regulatory motifs. Such methods can highlight homoplasy and alignment uncertainty with high accuracy or can be used to decrease alignment uncertainty effects in downstream analyses, for instance protein secondary structure prediction [34].

StatAlign is a statistical alignment software package [32] that allows joint Bayesian analysis of multiple alignments, phylogenetic trees and evolutionary parameters. The background model for insertions and deletions is a modified version of the TKF92 model [35] as described in [34]. The indel model can be coupled with an arbitrary substitution model (many of which are distributed with the software, both for protein and nucleotide sequence data). The Bayesian analysis is based on MCMC, the transition kernels are improved versions of those in [34]. StatAlign generates random samples from the joint posterior distribution of sequence alignments, evolutionary trees and model parameters. This high-dimensional joint distribution can be analysed in several ways, ranging from the simple statistics of marginalised single dimensions (e.g. the posterior distribution of a single rate parameter) to the application of other tools to the alignment samples.

### Alignment and RNA secondary structure accuracy metrics

To analyse variances of RNA secondary structure as alignment quality varies, we calculate a similarity score that measures how close a sample alignment is to the reference alignment. We use an alignment metric, taken from [36], which is generalised to an alignment method in [37].

*s*

_{1}of length

*n*to a sequence

*s*

_{2}of length

*m*. Each column of ${a}_{{s}_{1}}{,}_{{s}_{2}}$ can be expressed as pairs of the form $\left({s}_{1}^{i},{s}_{2}^{j}\right)$, $\left({s}_{1}^{i},-\right)$, and $\left(-,{s}_{2}^{i}\right)$. We define

_{1}and s

_{2}to be

For example, consider the case ${a}_{{s}_{1},{s}_{2}}^{k}={a}_{{s}_{1},{s}_{2}}^{\mathit{\ell}}.$ Then we have $n=m,\left|H\left({a}_{{s}_{1},{s}_{2}}^{k}\right)\cap H\left({a}_{{s}_{1},{s}_{2}}^{\mathit{\ell}}\right)\right|=n,\left|D\left({a}_{{s}_{1},{s}_{2}}^{k}\right)\cap D\left({a}_{{s}_{1},{s}_{2}}^{\mathit{\ell}}\right)\right|=0$ as there are no deletions, and $\left|I\left({a}_{{s}_{1},{s}_{2}}^{k}\right)\cap I\left({a}_{{s}_{1},{s}_{2}}^{\mathit{\ell}}\right)\right|=0$ as there are no insertions. This gives the distance between the alignments as zero, as would be expected.

*a*

^{ k }and ${a}^{\ell}$ are alignments of

*m*sequences

*s*

_{1}, …,

*s*

_{ m }of lengths

*n*

_{1}, …,

*n*

_{ m }, we have

The denominator of the fraction, $\left(m-1\right){\displaystyle \sum _{t=1}^{m}{n}_{t}}$, is the normalizing constant, the maximum that the alignment distance $d\left({a}^{k},{a}^{\ell}\right)$ can be. The similarity score is bounded by 0 and 1, with 1 indicating that the sample alignment is identical to the reference alignment.

#### RNA secondary structure metrics

The strength of these RNA secondary structure accuracy metrics is that they are easy to interpret, and make it straightforward to compare methods across different datasets. An F-score of 1 would represent a structure prediction that was completely correct and an F-score of 0 a structure prediction that only predicted incorrect base-pairs.

### Information entropy

*H*of a probability distribution P with a set of events $\mathcal{X}$ is defined as:

Information entropy is a measure for the “spread” of the probability distribution, and has well-defined lower and upper bounds. The minimum entropy of 0 occurs when there is only one outcome with probability 1, and the maximum entropy of log_{2} (*n*) occurs when there are *n* possible outcomes, each with probability 1/*n*, that is the uniform distribution. When the base of the logarithm is 2, the entropy is measured in bits. For a probability distribution, an entropy of *k* bits indicates that the expected value of the information content of observing a single outcome is *k* bits. In the context of secondary structure prediction, a low entropy therefore indicates that few secondary structures dominate the probability space, whereas a high entropy indicates a more even probability distribution over possible secondary structures. Thus, information entropy is a useful single quantity to characterize the underlying probability distribution of secondary structures.

The information entropy of the probability distribution over the possible secondary structures generated by a phylo-SCFG can be obtained using expected rule frequencies, which can be computed using the inside-outside algorithm [28]. This is outlined here.

*d*can be written as the product of the SCFG production rule probabilities and the phylogenetic probabilities, we can write the total probability

*T*of the grammar producing the input string as

_{ G }[

*d*] denotes the prior probabilities obtained from the SCFG part of the model, and P

_{ T }[

*d*] are the likelihood factors obtained from the phylogenetic model. Conditioning on producing the input string, the normalized probability of a derivation

*d*is ${P}_{\Phi}\left[d\right]=\frac{1}{T}P\left[d\right]=\frac{1}{T}{P}_{G}\left[d\right]{P}_{T}\left[d\right]$. Consequently, we have that the information entropy of the input alignment under the phylo-SCFG model is

*R*, we can write the SCFG contribution in terms of the expected production rule frequency,

*r*

_{ a }∈

*R*be a SCFG rule which produces base pairs, and

*r*

_{ b }∈

*R*a SCFG rule which produces unpaired bases. Define 1

^{ d }(

*i*,

*j*), the indicator function for whether the column pair (

*i*,

*j*) is emitted from a rule

*r*

_{ a }(i.e. position

*i*and

*j*form a pair), and 1

^{ s }(

*i*), the indicator function for whether column

*i*is emitted from a rule

*r*

_{ b }(i.e. position

*i*is unpaired). Finally, define

*f*

_{ d }(

*r*

_{ a }) as the frequency that rule

*r*

_{ a }is used in derivation

*d*. Then

As $\sum _{d\in \Phi}{1}^{d}\left(i,j\right)P\left[d\right]$ is the total probability under the model that positions *i* and *j* are emitted from a rule *r*_{
a
}, and $\sum _{d\in \Phi}{1}^{s}\left(i\right)P\left[d\right]$ is just the total probability under the model that position *i* is emitted from a rule *r*_{
b
}, the quantity $\sum _{d\in \Phi}P\left[d\right]{log}_{2}\left({P}_{T}\left[d\right]\right)$ can be computed using the expected rule frequencies obtained through the inside-outside algorithm [39].

## Methods

### Data

#### StatAlign Dataset

To test factors relating to alignment quality and secondary structure prediction quality, a large number of alignment samples from trusted reference alignments with known secondary structures are needed. We have created a curated RNA dataset based on the Rfam database [40] for the purposes of evaluating the framework. Alignments of homologous RNA sequences with known consensus secondary structure were extracted from Rfam seed alignments. From these, 50 RNA families with at least 50 sequences were randomly selected (see Additional file 1) in the section “StatAlign Dataset”. From each family, in a pre-filtering step we removed divergent sequences with long indels, as follows. We defined insertion as consecutive non-gap characters of a sequence in the reference alignment which appear in columns where over 80% of the sequences have gaps. Deletions were defined analogously. Columns with fraction of gaps between 20% and 80% were regarded ambiguous and ignored. To over-penalize long indels, we applied the super linear score function *l* × log_{2} (*l* + 1) for indels of length *l*, indels being defined as above. Then, a sequence was removed from a family if its total indel score was beyond 20 *and* the difference between its indel score and the mean indel score in the family was beyond 3.7 times the standard deviation of the indel scores in family, i.e. if the sequence had significantly more and/or longer indels than what is representative of the rest of the family. Then, 50 sequences were selected at random, and further random selection was done to get pairs, triplets etc. up to 15 sequences in alignments. From these samples of known reference alignment, we could produce many different alignment samples using StatAlign [32]. For each RNA alignment, 200 alignment samples were taken, and the reference alignments were also kept to for comparison. We refer to this dataset throughout as the *StatAlign dataset*.

#### Random alignment data

where *a*^{
r
} is the reference alignment, d is the target distance to get samples from and *t* is the temperature of the chain. To improve the mixing properties of the chains we allowed each chain to explore alignments that do not exactly match the specified target distance (with an exponentially decreasing probability, as described by Eq. 9) but then rejected non-exact matches when taking samples from the cold chain (*t* = 1).

- 1.breaking an alignment column into two columns by moving one of its characters into a new column, placing gaps in every other row:$\begin{array}{c}\hfill \begin{array}{ccc}\hfill \begin{array}{l}C\\ C\\ C\end{array}\hfill & \hfill \left|\begin{array}{l}A\\ A\\ A\end{array}\right|\hfill & \hfill \begin{array}{l}G\\ G\\ G\end{array}\hfill \end{array}\hfill \\ \hfill \underset{2.}{\overset{1.}{\rightleftharpoons}}\hfill \\ \hfill \begin{array}{ccc}\hfill \begin{array}{l}C\\ C\\ C\end{array}\hfill & \hfill \left|\begin{array}{l}-A\\ A-\\ A-\end{array}\right|\hfill & \hfill \begin{array}{l}G\\ G\\ G\end{array}\hfill \end{array}\hfill \end{array}$

- 2.
the exact reverse of the previous, i.e. joining two compatible adjacent columns - one having gaps in all but one row, the other having a gap in that row - to form a single column (see above)

- 3.relocating one character of a column to a gap position of an adjoining column:$\begin{array}{c}\hfill \begin{array}{ccc}\hfill \begin{array}{l}C\\ C\\ C\end{array}\hfill & \hfill \left|\begin{array}{l}\mathit{AG}\\ \mathit{AG}\\ A-\end{array}\right|\hfill & \hfill \begin{array}{l}T\\ T\\ T\end{array}\hfill \end{array}\hfill \\ \hfill \stackrel{3.}{\iff}\hfill \\ \hfill \begin{array}{ccc}\hfill \begin{array}{l}C\\ C\\ C\end{array}\hfill & \hfill \left|\begin{array}{l}\mathit{AG}\\ \mathit{AG}\\ -A\end{array}\right|\hfill & \hfill \begin{array}{l}T\\ T\\ T\end{array}\hfill \end{array}\hfill \end{array}$

As the space of alignments is vast and the moves are very local, to get a good approximation of uniform sampling, the number of steps that have to be done is on the order of millions, even for small inputs when a single chain is run. To this end, we created a special alignment representation where the above rearrangements can be carried out very efficiently (essentially constant time, i.e. in time proportional to column size but independent of the alignment length, even when the cost of randomly selecting the column to alter is included). For moderate input sizes (5-10 sequences of length 300-500) this representation allowed us to take 1.5 million rearrangement steps in a second (using a Java 6 implementation on one core of a 2.4 GHz Intel i3 processor).

A single chain is sufficient for small alignments, but very slow mixing becomes an issue for practical alignment sizes. We have found that standard parallel tempering techniques effectively speed up mixing if the chain temperatures are chosen correctly. Because the optimal temperatures vary significantly depending on reference alignment characteristics and target distribution, a simple acceptance optimization routine was added that tunes chain temperatures to achieve chain swap acceptance ratios between neighboring chains around a pre-set value. We found ratios 0.7-0.8 to be the most effective. With this framework, running 8-10 parallel chains was best to maximize the speed of convergence to the uniform distribution as compared to a single chain with sampling times 8-10-fold.

The above described framework was utilized to create a dataset consisting of the RNA families of StatAlign dataset, where for each family and each selection of 5 representative sequences from the family, 10 samples were taken at a distance corresponding to a similarity of 0.98 to the reference alignment (see Eq. 3), then 10 samples at a similarity of 0.96 etc., down to a similarity ratio of 0.6. We refer to this dataset as the *Random Alignment dataset*.

### Extending information entropy to alignment space

we would not want to suggest that these alignments give two different secondary structures. Consequently, we use a projection method to give alignment column pairing probability matrices the same dimension, so that the matrices can be averaged.

For a given set of input sequences, the sequence containing the greatest number of non-gap characters was chosen as the reference sequence. Each pairing probability matrix is projected by deleting columns and rows of the matrix corresponding to gap positions in the reference sequence, thus ensuring each matrix corresponding to an alignment sample has the same dimensions (a square matrix, with dimensions equal to the number of non-gap characters in the reference sequence). For example, we might start with an (*n* + 1) (*n* + 1) matrix, delete row *i* and column *i* due to a gap in position *i* in the reference sequence, then end up with an n × n matrix as required.

*ss*is

and there is no known efficient way to recurse over all possible alignments. Instead, we create an information entropy measure based on samples from the alignment space, and show that, in the sample-size limit, the alignment-sample information entropy tends to the true information entropy.

*a*

_{1}, …,

*a*

_{ n }from the space of all alignments of

*m*sequences, sampled according to their probability. If we are using statistical alignment, as in StatAlign, we will be sampling alignments in this fashion. Then we have for a column c, once alignments have been projected to the same length, the probability of being unpaired

_{ S }as the average of the sample phylogenetic probabilities:

*n*→

*∞*, by the weak law of large numbers

as required.

*H*

_{Φ}(P) of grammar derivations Φ of a grammar

*G*is

Since tree probabilities are the product of unpaired and paired column probabilities in the derivation, the tree probabilities can be recalculated from the sample of alignments. These can then be substituted into the above equation to get an approximation to the information entropy over the space of sampled alignments as well. We refer to this entropy of more than one alignment as the *alignment consensus entropy*.

## Results and discussion

### Alignment quality and predictive accuracy

A common question in comparative RNA secondary structure prediction is how many sequences are required to get a good structure prediction. This is briefly addressed in [11], but the sample size is quite small, and only total accuracy is considered. With 15 sequences in the alignment, we assume that no more evolutionary information can be gained by adding further sequences, but when fewer sequences are present, lack of information might yield a poorer structure prediction.

Instead of considering total accuracy, we wanted to quantify relatively how much accuracy is lost when fewer sequences are present. For example, if an alignment with 15 sequences is predicted with an average F-score of 0.5, and an alignment of the same family with 3 sequences is predicted with an average F-score of 0.4, then 80% of the accuracy will have been retained, that is the alignment with 3 sequences has a *relative F-score* of 0.8.

To investigate how many sequences are needed for a good structure prediction, we took the StatAlign dataset and considered the relative F-score for each family. Almost 100% of the possible F-score was achieved when the alignment contained 5 sequences, for both PPfold and RNAalifold. Interestingly, the accuracy of RNAalifold decreased slightly as the number of sequences was increased. This is due to the increased number of non-canonical base-pairs in the alignments, which the thermodynamic method could not predict. PPfold, on the other hand, has a small probability for non-canonical base-pairs, so is not affected by these in the same way. Overall these results suggest that 5 sequences are sufficient for approaching maximal predictive accuracy.

Firstly, in the StatAlign dataset (Figure 2A), we observe a weaker correlation between alignment distance and accuracy than for the Random Alignment dataset (Figure 2B). This suggests that predictions are better for the StatAlign dataset on alignments far from the reference alignment. StatAlign looks to produce alignments with a high likelihood under an evolutionary model, which the random alignments do not consider, and so StatAlign's alignments could be considered more realistic. This con rms the expectation that StatAlign's alignments are more useful for RNA secondary structure prediction than random alignments.

Secondly, for the StatAlign dataset, we see a much higher correlation with the F-score than with the relative F-score. Some families of RNA consistently produce the same alignment, which skews the graphs. For example, an alignment which consistently has similarity to the reference alignment of 0.9, and F-score 0.5 would give a relative F-score of 1 every time, and would support the correlations seen on each graph. Because we can control the spread of distances in the random alignment data set, we don't see this behaviour. As expected, variation comes more with families that produce more variable alignments. In the StatAlign dataset, this is obscured by those families which produce consistent alignments.

Lastly, for the Random Alignment dataset (Figure 2B), we see many more zero quality predictions in the case of RNAalifold than in the case of PPfold. This is most easily seen by the larger intensity of red in the main body of the heatmap for PPfold. This suggests that PPfold functions better than RNAalifold when given a low-quality alignment, likely due to its more complete model for molecular evolution.

#### Evolutionary distance

We also consider the effects of evolutionary distance on RNA secondary structure prediction quality. One might expect that there is a “sweet spot” for evolutionary distance- sequences too close to each other do not display enough co-variation to benefit the evolutionary model, but the evolutionary signal might be lost if the distance is too large. To investigate this, we measured evolutionary distance in the phylogenetic trees predicted by PPfold using four different measures:

**Measure 1-** Mean of all the evolutionary distances,

**Measure 2-** Standard deviation of all the evolutionary distances,

**Measure 3-** Maximum evolutionary distance,

**Measure 4-** Maximum difference between evolutionary distances.

All four measures would be expected to be correlated with sequence length, which is well known to correlate with predictive accuracy. To account for this, we considered relative evolutionary distance, similar to the relative predictive accuracy above. A measure was normalized by the average for that family and number of sequences, so that it could be seen whether an alignment had greater or less evolutionary distance than might be expected. We then looked at the correlation between relative evolutionary distance and predictive accuracy. All methods correlated extremely poorly with predictive accuracy and relative predictive accuracy, measures 1 to 4 having *R*^{2} correlations with relative predictive accuracy of 0.0259, 0.0373, 0.0303, and 0.0265 respectively (data not shown). This suggests that evolutionary distance is not an underlying factor for variation in RNA secondary structure prediction, and those other factors, such as those seen in [26], play a more important role in determining predictive accuracy.

### Alignment distances and maximum posterior decoding

Given the results concerning accuracy lost as alignment quality decreases, it would be desirable to be able to predict alignment quality, with the hope of predicting structure prediction quality. This has previously been attempted in [36]. First, the sequences were aligned with ClustalW [18]. The sequences were then re-aligned using 4 other programs (Align-m [41], MUSCLE [42], Prob-Cons [43], and T-Coffee [44]) and the similarity between the alignment generated using ClustalW to each of the 4 other alignments was measured. The maximum of the 4 similarities, max (*g*), was chosen as a predictor of alignment quality. The authors of [36] detected a strong correlation between the true similarity (the similarity between the ClustalW alignment and a reference alignment) and max (*g*).

We implemented a modified version of this method. For a given set of input sequences we aligned with both AMAP [36] and with StatAlign, obtaining the maximum posterior decoding alignment (MPD alignment) from StatAlign. The similarity between the AMAP alignment and the MPD alignment was used as our predicted similarity measure. This produced a strong correlation between our predicted similarity and true similarity, with an adjusted R-squared value of 0.6524.

### Extending information entropy to alignment space

### Predicting secondary structure accuracy

Given strong correlations between the alignment quality and the structure prediction quality, we might expect that we could find a predictor of structure prediction accuracy. By “integrating out” alignment uncertainty, we may find a reliability score which is more reliable than the one currently reported by the PPfold. To test this, we predicted accuracy for one of the five-sequence alignments of each family and then tested the predicted accuracy against the true accuracy. The PPfold reliability score produced an adjusted *R*^{2} score of 0.252 when considering correlation with the true F-scores.

*R*

^{2}value with the new reliability measure improved to 0.496. These results seem to indicate that while alignment quality does affect structure prediction quality, the actual structure prediction model still plays a great role in the overall prediction accuracy. Consequently, improving these models, possibly by incorporating other kinds of information (such as experimental probing data), is an area where new research e orts are still needed in RNA secondary structure prediction.

## Conclusions

In this paper we have explored a number of factors which can contribute to poor RNA secondary structure prediction quality. We established a relationship between alignment quality and expected loss of accuracy. Furthermore, we provided a method to predict alignment quality based only on statistical alignment samples. While our predictor of accuracy improves on the PPfold reliability score, there is still a large amount of variability unaccounted for, which we therefore suggest comes from the predictive model itself. To consider this further, we extended the information entropy measure for SCFGs to consider uncertainty in alignments.

The fact that our accuracy predictor did not account for all the variances associated with RNA secondary structure prediction, despite good predictors being found for alignment quality and a strong correlation between alignment quality and predictive accuracy, suggests that whilst alignment quality is an important factor, the predictive model itself determines plays a large part in the quality of prediction. Given what is shown in Figure 1 for single sequence predictions, that the accuracy of PPfold and RNAfold is very variable, it is unsurprising that variances remain. Clearly then, further efforts should be put into creating stronger single-sequence models, and then the advantages of evolutionary modelling and additional structural constraints will benefit further. The use of experimental data from new probing experiments as well as more biologically realistic constraints, such as kinetic or co-transcriptional folding, may improve the results of RNA secondary structure prediction.

## Declarations

### Acknowledgements

In part, this work was carried out as part of the Oxford Summer School in Computational Biology, 2012, in conjunction with the Department of Plant Sciences and the Department of Zoology. Funding was provided by the EU COGANGS Grant. ÁN would like to thank BBSRC for continued funding through OCISB. JWJA would like to thank the EPSRC for funding.

## Authors’ Affiliations

## References

- Chu D, Barnes DJ, von der Haar T: The role of tRNA and ribosome competition in coupling the expression of different mRNAs in Saccharomyces cerevisiae. Nucleic Acids Res. 2011, 39 (15): 6705-6714. 10.1093/nar/gkr300.PubMed CentralView ArticlePubMedGoogle Scholar
- Gahura O, Hammann C, Valentova A, Puta F, Folk P: Secondary structure is required for 3' splice site recognition in yeast. Nucleic Acids Res. 2011, 39 (22): 9759-9767. 10.1093/nar/gkr662.PubMed CentralView ArticlePubMedGoogle Scholar
- Kertesz M, Wan Y, Mazor E, Rinn JL, Nutter RC, Chang HY, Segal E: Genome-wide measurement of RNA secondary structure in yeast. Nature. 2010, 467 (7311): 103-107. 10.1038/nature09322.View ArticlePubMedGoogle Scholar
- Washietl S, Hofacker IL, Lukasser M, Huttenhofer A, Stadler PF: Mapping of conserved RNA secondary structures predicts thousands of functional noncoding RNAs in the human genome. Nat Biotech. 2005, 23 (11): 1383-1390. 10.1038/nbt1144.View ArticleGoogle Scholar
- Hofacker I, Fontana W, Stadler P, Bonhoeffer L, Tacker M, Schuster P: Fast folding and comparison of RNA secondary structures. Chem Monthly. 1994, 125 (2): 167-188. 10.1007/BF00818163.View ArticleGoogle Scholar
- Markham NR, Zuker M, Keith JM, Walker JM: Bioinformatics. UNAFold. 2008, Totowa, NJ: Humana Press, 3-31. [Methods in Molecular Biology; SP: 3]Google Scholar
- Bernhart S, Hofacker I, Will S, Gruber A, Stadler P: RNAalifold: improved consensus structure prediction for RNA alignments. BMC Bioinformatics. 2008, 9: 474-10.1186/1471-2105-9-474.PubMed CentralView ArticlePubMedGoogle Scholar
- Seemann SE, Gorodkin J, Backofen R: Unifying evolutionary and thermodynamic information for RNA folding of multiple alignments. Nucleic Acids Res. 2008, 36 (20): 6355-6362. 10.1093/nar/gkn544. http://nar.oxfordjournals.org/content/36/20/6355.abstract,PubMed CentralView ArticlePubMedGoogle Scholar
- Anderson JWJ, Tataru P, Staines J, Hein J, Lyngso R: Evolving stochastic context-free grammars for RNA secondary structure prediction. BMC Bioinformatics. 2012, 13: 78-10.1186/1471-2105-13-78.View ArticleGoogle Scholar
- Dowell R, Eddy S: Evaluation of several lightweight stochastic context-free grammars for RNA sec-ondary structure prediction. BMC Bioinformatics. 2004, 5: 71-10.1186/1471-2105-5-71.PubMed CentralView ArticlePubMedGoogle Scholar
- Knudsen B, Hein J: RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics. 1999, 15 (6): 446-454. 10.1093/bioinformatics/15.6.446.View ArticlePubMedGoogle Scholar
- Knudsen B, Hein J: Pfold: RNA secondary structure prediction using stochastic context-free gram-mars. Nucleic Acids Res. 2003, 31 (13): 3423-3428. 10.1093/nar/gkg614.PubMed CentralView ArticlePubMedGoogle Scholar
- Felsenstein J: Evolutionary trees from DNA sequences: A maximum likelihood approach. J Mol Evol. 1981, 17 (6): 368-376. 10.1007/BF01734359.View ArticlePubMedGoogle Scholar
- Gardner P, Giegerich R: A comprehensive comparison of comparative RNA structure prediction approaches. BMC Bioinformatics. 2004, 5: 140-10.1186/1471-2105-5-140.PubMed CentralView ArticlePubMedGoogle Scholar
- Sukosd Z, Knudsen B, Kjems J, Pedersen CNS: PPfold 3.0: fast RNA secondary structure prediction using phylogeny and auxiliary data. Bioinformatics. 2012, 28 (20): 2691-2692. 10.1093/bioinformatics/bts488.View ArticlePubMedGoogle Scholar
- Andronescu M, Bereg V, Hoos H, Condon A: RNA STRAND: The RNA Secondary Structure and Statistical Analysis Database. BMC Bioinformatics. 2008, 9: 340-10.1186/1471-2105-9-340.PubMed CentralView ArticlePubMedGoogle Scholar
- Lyngsoe R, Anderson J, Sizikova E, Badugu A, Hyland T, Hein J: Frnakenstein: multiple target inverse RNA folding. BMC Bioinformatics. 2012, 13: 260-10.1186/1471-2105-13-260.View ArticleGoogle Scholar
- Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, Higgins DG, Thompson JD: Multiple sequence align-ment with the Clustal series of programs. Nucleic Acids Res. 2003, 31 (13): 3497-3500. 10.1093/nar/gkg500.PubMed CentralView ArticlePubMedGoogle Scholar
- Katoh K, Kuma K, Toh H, Miyata T: MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 2005, 33 (2): 511-518. 10.1093/nar/gki198.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang LS, Leebens-Mack J, Wall PK, Beckmann K, de Pamphilis CW, Warnow T: The Impact of Multiple Protein Sequence Alignment on Phylogenetic Estimation. IEEE/ACM Trans Comput Biol Bioinform. 2011, 8 (4): 1108-1119.View ArticlePubMedGoogle Scholar
- Chivian D, Baker D: Homology modeling using parametric alignment ensemble generation with con-sensus and energy-based model selection. Nucleic Acids Res. 2006, 34 (17): e112-e112. 10.1093/nar/gkl480.PubMed CentralView ArticlePubMedGoogle Scholar
- Gardner PP, Wilm A, Washietl S: A benchmark of multiple sequence alignment programs upon structural RNAs. Nucleic Acids Res. 2005, 33 (8): 2433-2439. 10.1093/nar/gki541.PubMed CentralView ArticlePubMedGoogle Scholar
- Bradley RK, Pachter L, Holmes I: Speci c alignment of structured RNA: stochastic grammars and sequence annealing. Bioinformatics. 2008, 24 (23): 2677-2683. 10.1093/bioinformatics/btn495.PubMed CentralView ArticlePubMedGoogle Scholar
- Doose G, Metzler D: Bayesian sampling of evolutionarily conserved RNA secondary structures with pseudoknots. Bioinformatics. 2012, 28 (17): 2242-2248. 10.1093/bioinformatics/bts369.View ArticlePubMedGoogle Scholar
- Harmanci A, Sharma G, Mathews D: TurboFold: Iterative probabilistic estimation of secondary structures for multiple RNA sequences. BMC Bioinformatics. 2011, 12: 108-10.1186/1471-2105-12-108.PubMed CentralView ArticlePubMedGoogle Scholar
- Engelen S, Tahi F: Predicting RNA secondary structure by the comparative approach: how to select the homologous sequences. BMC Bioinformatics. 2007, 8: 464-10.1186/1471-2105-8-464.PubMed CentralView ArticlePubMedGoogle Scholar
- Engelen S, Tahi F: Tfold: efficient in silico prediction of non-coding RNA secondary structures. Nucleic Acids Res. 2010, 38 (7): 2453-2466. 10.1093/nar/gkp1067.PubMed CentralView ArticlePubMedGoogle Scholar
- Sukosd Z, Knudsen B, Anderson JWJ, Novak A, Kjems J, Pedersen C: Characterising RNA secondary structure space using information entropy. BMC Bioinformatics. 2013, 14: S22-10.1186/1471-2105-14-22.PubMed CentralPubMedGoogle Scholar
- Hein J, Wiuf C, Knudsen B, Moller MB, Wibling G: Statistical alignment: computational properties, homology testing and goodness-of- fit. J Mol Biol. 2000, 302: 265-279. 10.1006/jmbi.2000.4061.View ArticlePubMedGoogle Scholar
- Lunter G, Miklos I, Drummond A, Jensen J, Hein J: Bayesian coestimation of phylogeny and sequence alignment. BMC Bioinformatics. 2005, 6: 83-10.1186/1471-2105-6-83.PubMed CentralView ArticlePubMedGoogle Scholar
- Suchard MA, Redelings BD: BAli-Phy: simultaneous Bayesian inference of alignment and phylogeny. Bioinformatics. 2006, 22 (16): 2047-2048. 10.1093/bioinformatics/btl175.View ArticlePubMedGoogle Scholar
- Novak A, Miklos I, Lyngso R, Hein J: StatAlign: an extendable software package for joint Bayesian estimation of alignments and evolutionary trees. Bioinformatics. 2008, 24 (20): 2403-2404. 10.1093/bioinformatics/btn457.View ArticlePubMedGoogle Scholar
- Satija R, Novak A, Miklos I, Lyngsø R, Hein J: BigFoot: Bayesian alignment and phylogenetic footprinting with MCMC. BMC Evol Biol. 2009, 9: 217-10.1186/1471-2148-9-217.PubMed CentralView ArticlePubMedGoogle Scholar
- Miklos I, Novak A, Dombai B, Hein J: How reliably can we predict the reliability of protein structure predictions?. BMC Bioinformatics. 2008, 9: 137-10.1186/1471-2105-9-137.PubMed CentralView ArticlePubMedGoogle Scholar
- Thorne JL, Kishino H, Felsenstein J: Inching toward reality: An improved likelihood model of sequence evolution. J Mol Evol. 1992, 34: 3-16. 10.1007/BF00163848.View ArticlePubMedGoogle Scholar
- Schwartz AS, Myers EW, Pachter L: Alignment Metric Accuracy. http://arxiv.org/abs/q-bio/0510052,
- Schwartz AS, Pachter L: Multiple alignment by sequence annealing. Bioinformatics. 2007, 23 (2): e24-e29. 10.1093/bioinformatics/btl311.View ArticlePubMedGoogle Scholar
- Freyhult E, Gardner P, Moulton V: A comparison of RNA folding measures. BMC Bioinformatics. 2005, 6: 241-10.1186/1471-2105-6-241.PubMed CentralView ArticlePubMedGoogle Scholar
- Lari K, Young SJ: The estimation of stochastic context-free grammars using the Inside-Outside algorithm. Comput Speech Language. 1990, 4: 35-56. 10.1016/0885-2308(90)90022-X.View ArticleGoogle Scholar
- Griffiths-Jones S, Moxon S, Marshall M, Khanna A, Eddy SR, Bateman A: Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res. 2005, 33 (suppl 1): D121-D124.PubMed CentralPubMedGoogle Scholar
- Walle IV, Lasters I, Wyns L: Align-m- a new algorithm for multiple alignments of highly divergent sequences. Bioinformatics. 2004, 20 (9): 1428-1435. 10.1093/bioinformatics/bth116.View ArticlePubMedGoogle Scholar
- Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004, 32 (5): 1792-1797. 10.1093/nar/gkh340.PubMed CentralView ArticlePubMedGoogle Scholar
- Do CB, Mahabhashyam MSP, Brudno M, Batzoglou S: ProbCons: Probabilistic consistency-based mul-tiple sequence alignment. Genome Res. 2005, 15 (2): 330-340. 10.1101/gr.2821705.PubMed CentralView ArticlePubMedGoogle Scholar
- Cedric Notredame DGH, Heringa J: T-Coffee: A Novel Method for Fast and Accurate Multiple Sequence Alignment. J Mol Biol. 2000, 302: 205-217. 10.1006/jmbi.2000.4042.View ArticleGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.