Dinucleotide controlled null models for comparative RNA gene prediction
- Tanja Gesell^{1, 2, 3, 4} and
- Stefan Washietl^{5, 6}Email author
DOI: 10.1186/1471-2105-9-248
© Gesell and Washietl; licensee BioMed Central Ltd. 2008
Received: 21 January 2008
Accepted: 27 May 2008
Published: 27 May 2008
Abstract
Background
Comparative prediction of RNA structures can be used to identify functional noncoding RNAs in genomic screens. It was shown recently by Babak et al. [BMC Bioinformatics. 8:33] that RNA gene prediction programs can be biased by the genomic dinucleotide content, in particular those programs using a thermodynamic folding model including stacking energies. As a consequence, there is need for dinucleotide-preserving control strategies to assess the significance of such predictions. While there have been randomization algorithms for single sequences for many years, the problem has remained challenging for multiple alignments and there is currently no algorithm available.
Results
We present a program called SISSIz that simulates multiple alignments of a given average dinucleotide content. Meeting additional requirements of an accurate null model, the randomized alignments are on average of the same sequence diversity and preserve local conservation and gap patterns. We make use of a phylogenetic substitution model that includes overlapping dependencies and site-specific rates. Using fast heuristics and a distance based approach, a tree is estimated under this model which is used to guide the simulations. The new algorithm is tested on vertebrate genomic alignments and the effect on RNA structure predictions is studied. In addition, we directly combined the new null model with the RNAalifold consensus folding algorithm giving a new variant of a thermodynamic structure based RNA gene finding program that is not biased by the dinucleotide content.
Conclusion
SISSIz implements an efficient algorithm to randomize multiple alignments preserving dinucleotide content. It can be used to get more accurate estimates of false positive rates of existing programs, to produce negative controls for the training of machine learning based programs, or as standalone RNA gene finding program. Other applications in comparative genomics that require randomization of multiple alignments can be considered.
Availability
SISSIz is available as open source C code that can be compiled for every major platform and downloaded here: http://sourceforge.net/projects/sissiz.
Background
Comparative genome analysis is currently the most widely used strategy to detect and annotate noncoding RNAs (ncRNAs) [1, 2]. In the past few years a series of different algorithms have been developed that predict functional ncRNAs on the basis of conserved secondary structure [3–10]. Some of these methods have been used to predict novel ncRNAs on a genome wide scale [7, 11–14]. In combination with experimental verification (microarray, RT-PCR, Northern blot) these methods could successfully uncover many examples of novel ncRNAs [15–20]. However, in particular in large vertebrate genomes the signal-to-noise ratio of true predictions and false positives is thought to be relatively low [20]. In a recent paper, Babak and colleagues demonstrated that comparative ncRNA gene finders are strongly biased by the genomic dinucleotide content leading to an excess of false predictions [21]. Especially methods that are based on a thermodynamic folding model are sensitive to this effect: In the so-called nearest neighbour model, energies are not assigned to single base-pairs but rather to neighbouring base-pairs that stack on each other. As a consequence, the folding stability of genomic sequences does not only depend on the monunucleotide content but also the dinucleotide content.
To assess the significance of predicted structures, e.g. to estimate the false discovery rate in a genomic screen for ncRNAs, one should therefore compare the genomic predictions to the results obtained on randomized data with the same dinucleotide content. In the case of single sequences, there are well known and widely used algorithms to generate dinucleotide controlled random sequences either by shuffling or first order Markov chain simulation [22, 23]. However, there is currently no algorithm to randomize multiple sequence alignments preserving the dinucleotide content. Babak and colleagues [21] added the conservation of dinucleotides as an additional constraint to the commonly used (mononucleotide) shuffling algorithm shuffle-aln.pl [5] and applied it to pairwise alignments. Their approach corresponds to a heuristic used in reference 24, that is very inefficient as only a small subspace of the whole permutation space is covered. The heuristic exchanges only positions that have the same neighbours left and right. For the short sequence ACAGCCAA for example not a single permutation can be found that way. However, there are 11 such permutations according to the Altschul & Erikson algorithm [22]. But even a more efficient shuffling algorithm will soon run into difficulties on multiple alignments. Unless two neighbouring columns are 100% conserved, there are several different dinucleotide pairs in these columns. It is therefore impossible to exactly preserve the dinucleotide content as in the single sequence case.
In this paper, we address the problem in a different way. In analogy to a first order Markov model for single sequences, we simulate alignments of a given dinucleotide content. We present a substitution model that captures the neighbour dependencies and all other relevant alignment features. We describe a time efficient way to estimate a tree under this model that we use as a guide to simulate alignments of the desired properties. This new control strategy is tested on genomic alignments and the effect on thermodynamic RNA structure predictions is studied.
Results
Requirements for an accurate null model
An optimal null model preserves all the features of the original data with the exception of the signal under question that needs to be removed efficiently. In our case, the data are multiple alignments of homologous sequences and the signal of interest is an evolved RNA secondary structure. Correlations arising from base-pairing patterns need to be removed. Currently, alignments are usually randomized by shuffling the alignment columns (see ref. 5 for a discussion of this method). Although the shuffling approach has its limitations and considering dinucleotides seems difficult, it is an appealing approach because it is relatively simple, fast, and extremely conservative. Changing the order of the columns does not change the mutational patterns within the columns and thus the underlying phylogenetic tree is exactly preserved.
In this paper we attempt to simulate new alignments from scratch. Even the most sophisticated model cannot capture all evolutionary processes and therefore a simulation approach will inevitably change the original data more than shuffling does. So much care has to be taken to preserve all the relevant characteristics of the data. To qualitatively assess the most important parameters that need to be considered in our model, we performed a series of simulation experiments. Using a simple tree with four taxa we simulated alignments under the HKY evolutionary model [25]. We systematically varied model and tree parameters to study how they affect thermodynamic RNA consensus structure predictions in the alignment. We used RNAalifold [26] to predict consensus secondary structures which is the basis of the AlifoldZ [5] and RNAz [6] gene finders.
Another major parameter that needs to be controlled is the sequence diversity of the alignment. Variation of the branch lengths of the tree gives alignments with different sequence diversity which we usually measure as the mean pairwise sequence identity (MPI, also sometimes refered to as average pairwise sequence identity APSI). High diversity (i.e. low MPI) makes it difficult to predict a consensus structure if there is no selection pressure for it. On the other hand, almost perfectly conserved sequences fold readily in some random structure even if there is no natural RNA structure present. Therefore we observe a strong dependency on the MPI (Fig. 1C).
One well known characteristic of natural mutation processes are the different rates for transitions and transversions [27]. Interestingly, this also affects the consensus structure predictions. A model with equal transition/transversion rates (parameter κ = 1 in the HKY model) gives less stable predictions than a model with more realistic rates (e.g κ = 4, Fig. 1D). This parameter affects the type of column patterns observed in the simulated alignments which in turn affects how well they can form consensus base pairs.
Natural mutation processes are not homogeneous across all sites, in particular in functional genomic regions. It was observed previously that mutation patterns within an alignment can affect structure predictions [5]. For example, an alignment containing a 100% conserved block with low mutation rate that is flanked by highly divergent regions of high mutation rate can have different folding energies compared to an alignment with homogeneous rates but the same overall MPI (Fig. 1E). The same is true for patterns of insertions and deletions which was also already discussed in reference 5 and which we do not show here explicitly again.
We also tested the effect of different tree topologies, but did not find a significant influence of this parameter at least in our four taxa example.
Taken together, an accurate randomization procedure needs to generate alignments that preserve (i) mono- and dinucleotide content, (ii) mean pairwise sequence identity, (iii) transition/transversion rate ratio (iv) site-specific mutation rates, and (v) gap patterns.
In the next section we describe a model that is capable of simulating alignments under these constraints.
Algorithm
Model
Sequence evolution is usually described by a time-continuous Markov process [27, 28]. The most commonly used models assume that all sites of a sequence evolve independently from each other rendering it impossible to model dinucleotide dependencies between neighbouring pairs. Various evolutionary models have been proposed in the past years to overcome this limitation [29–36]. We make use of the recently introduced framework called SISSI (SImulating Site-Specific Interactions). SISSI allows to define site dependencies of arbitrary complexity in the form of a "neighbourhood system" that also may include overlapping dependencies [37]. Given the requirements of our specific problem, we extended and simplified several aspects of SISSI as necessary.
Thus, the instantaneous rate matrix Q_{ k }has the dimension |$\mathcal{A}$|^{3} × |$\mathcal{A}$|^{3} = 64 × 64. The stationary distribution of Q_{ k }determines the equilibrium dinucleotide content of our system (see the next section for how the required trinucleotide frequencies of Q_{ k }are calculated from the dinucleotide frequencies).
We impose the usual restriction, that only one substitution per unit time is admissible [38, 39]. Moreover, Q_{ k }only allows for substitutions at site k. The diagonal elements of our instantaneous rate matrix Q_{ k }are defined by the mathematical requirement that the sum of each row is zero.
where π_{ k }(y) is the stationary frequency of y and the Hamming distance H(s_{ k }, y) counts the number of differences between the sites of the triplets s_{ k }and y.
In principle, we can choose any rate for the parameter r(s_{ k }, y). However, based on the requirement that we want to use the counted dinucleotide content as the stationary distribution, we choose r(s_{ k }, y) so that the model becomes reversible. Any parameter of the commonly used independent nucleotide substitution models, like HKY [25] or the general time-reversible model GTR [40] can be chosen for r(s_{ k }, y). For our application, we use a transition/transversion rate ratio and set r(s_{ k }, y)= κ for transitions and r(s_{ k }, y) = 1 for transversions.
Without dependencies on the neighbours, Q_{ k }is of dimension 4 × 4 and the model reduces essentially to a HKY model with a specific rate for each site. We use this mononucleotide variant later in this paper for testing and comparison to the dinucleotide model.
Simulation
In the most general SISSI framework Q_{ k }needs to be updated for all k sites every time one nucleotide in x is substituted. However, in our special case we can use the same instantaneous rate matrix Q_{ k }for each site with special conditions for r(s_{ k }, y). As a consequence, we can fix q(x) and do not need to sum over each rate of the site, which improves the running time of the algorithm.
Parameter estimation
The idea of our randomization procedure is to estimate a tree under the model described in the previous section and simulate sequences along this tree. Ideally, all parameters are estimated simultaneously within a maximum likelihood framework. One problem is the high number of parameters since we want to estimate a specific rate for each site. A more fundamental issue is, however, that our model includes overlapping dependencies which breaks the independence assumption necessary for basic maximum likelihood estimation. Other possible techniques like Markov chain Monte Carlo in a Bayesian framework are not a viable alternative either. Speed is a critical issue as the algorithm is meant to be applied to data on a genome wide scale.
Facing these difficulties, we have developed heuristic approximations to estimate the parameters and use a distance based approach to estimate the tree. The method is fast and yet surprisingly accurate for our application.
Equilibrium frequencies
where π(αβγ) are the trinucleotide frequencies, π(αβ) and π(βγ) the counted dinucleotide frequencies and π(β) = Σ_{ α }π(αβ) = Σ_{ α }π(βα) with α, β, γ ∈ {A, C, G, U}.
Distances and tree construction
Using this function, all pairwise distances are calculated for the sequences in the original alignment. From this distance matrix a tree is constructed using the BIONJ algorithm [41]. BIONJ is a variant of the well known neighbour joining algorithm and currently one of the most accurate algorithms for distance based tree building.
Given that the distances and the tree are accurately estimated, we observe on average the same mean pairwise identity in the simulated alignment as in the original one. Fig. 3C shows the distribution of MPIs of 1000 simulations of our example rRNA alignment. The average MPI of the simulations is exactly the same as the MPI 0.73 of the original alignment.
Site-specific rates
Setting different mutation rates at different sites gives us the possibility to preserve natural mutation patterns of the original alignment. The problem of finding accurate site-specific rates is illustrated in Fig. 3D. For each site in the alignment, the MPI of this site is plotted against the average MPI observed in the simulated alignments on the same site. If we consider equal rates for all sites, each site will have the same average MPI which is of course equal to the overall MPI of 0.73 of the whole alignment. Ideally, the average MPI for each simulated site is the same as the original MPI at this site. In this case, the points in the plot are on a diagonal indicating that we have found accurate estimates for the rates.
Simple estimates for site-specific rates in combination with distance based trees have been described previously [42]. The method includes fits to a gamma distribution which requires data of at least 1000 nucleotides and 30 sequences to get reasonable results. Here we use a different approach that also gives good results for smaller alignments.
It must be pointed out that the site-specific rates change the relationship between genetic distance and observed differences (Fig. 3B). For correcting the site-specific rates we use the estimates for $\widehat{a}$ and $\widehat{b}$ from our model without site-specific rates. So this is only an approximation and one could think about iteratively refining the estimates. However, we found that this approach already yields accurate rates within one step as can be seen in Fig. 3D. Using the model with site-specific rates, the simulated alignments have on average almost exactly the same site-wise MPI as the original one.
The reader will notice that the first three points deviate from the diagonal. This illustrates a limitation of our method. With our simulation procedure we can on average only reach the level of saturation even if we use very high rates. It is possible, however, that the original data contains sites below the level of saturation. For example in a four way alignment a column can be ACGT, i.e. MPI = 0. However, we cannot simulate on average columns with MPI = 0, since the MPI is bounded below by zero and our simulations will always contain columns with MPI > 0. In practice this does not seem to cause any obvious problems in particular when we have many sequences where it is unlikely to see columns below saturation.
Gaps
Gaps have been ignored completely so far. There are evolutionary models including deletions and insertions [43–47] and, in principle, it would be possible to combine the insertion-deletion dynamics with our model. However, this does not appear practical in our case. Existing algorithms for joint estimation of phylogenies and alignments are not only very time-consuming [47], it also seems difficult to estimate reasonable indel model parameters on relatively short alignment blocks which hold only little information. Moreover, alignment programs produce gap patterns that do not necessarily reflect phylogenetically reasonable insertion/deletion events and thus cannot always be captured by an idealized model that is motivated by evolutionary processes and ignores algorithmic idiosyncrasies of alignment programs.
So we follow here a very pragmatic strategy that has also been used previously [5]: We keep exactly the same gap pattern in our randomized alignments as in the original alignment. To this end, we simply treat gaps as missing data and simulate nucleotide characters for the gapped positions. This is done in a way that the overall characteristics are not changed when they are replaced with gaps again at the end (see Methods for details).
Transition/transversion rate ratio
The transition/transversion rate ratio κ is a parameter in our model that cannot be simply counted as in the case of the dinucleotide frequencies, or empirically determined like the branch lengths. Given that the influence of this parameter is not that critical as for example the branch length or base composition (see Fig. 1), one possibility might be to use a fixed transition/transversion ratio if a reasonable average value is known for the genome at hand. Alternatively, we found that a good estimate can be obtained by using maximum likelihood on an independent mononucleotide model. We used here the HKY model with -distributed rates which is closest to our dinucleotide model.
Putting it together
Implementation
We implemented our method in ANSI C in a program called SISSIz. The source code is available under the GNU Public Licence for download [48].
Some words on running time: One might suspect that the randomization algorithm including two times the sampling procedure to estimate the parameters of equation 9 and the maximum likelihood estimation of the transition/transversion rate ratio is relatively slow. Indeed, it is much slower than for example randomization by shuffling, but still very fast. To build the model for our example of 7 rRNAs of 158 length takes 0.2 seconds on a modern Intel Core 2 Quad CPU at 2.4 GHz. To simulate 1000 alignments using this model takes another 0.6 seconds.
Testing
Randomizing vertebrate genomic alignments
We tested our randomization method on vertebrate genomic alignments. In a setting similar to recent genomic screens in vertebrates [11, 20], we extracted Multiz [49] alignment blocks from human chromosome 1. We randomly selected 1000 alignment blocks between 70 and 120 nt in length and between 4 and 10 sequences without considering annotation information of any sort. These alignments are meant to represent an unbiased "genomic background" that may also contain functional elements like coding exons or structured RNAs depending on their frequency in the genome.
Influence of randomization procedure on RNA predictions
Clearly, the differences shown here in these cumulative histograms might appear very subtle. The results for the RNAz predictions, however, show that such differences can strongly affect the statistics of RNA gene predictions (Fig. 8B). On this particular test set, RNAz predicts RNA signals in 4.3% of the native alignments. Using the conventional shuffling strategy or mononucleotide based simulation one would estimate a false positive rate of 0.8% or 0.7%, respectively. Using the more conservative dinucleotide based model the estimate would be 2.1%, i.e. three times higher. This is consistent with the results obtained by Babak et al. using their dinucleotide shuffling approach on pairwise alignments.
Calculating z-scores to predict structural RNAs
We can directly assess the significance of a predicted RNA by calculating a z-score. The folding energy of the native data m and the mean μ and standard deviation σ of randomized data is calculated. The significance of the native fold can then be expressed as z = (m - μ)/σ, i.e. the number of standard deviations from the mean. This score has been repeatedly used on single sequences applying mono- or dinucleotide shuffling or simulation using a zero or first order Markov model [23, 24]. Using shuffled alignments as null model, this approach is implemented in the RNA gene finding program AlifoldZ [5]. The same strategy can be used in combination with our new dinucleotide base randomization strategy without any further modifications (Fig. 4).
To test the effectiveness of this approach, we conducted a benchmark similar to those used previously [5, 6] for testing AlifoldZ and RNAz. We used multiple sequence alignments of eight different structural RNA families taken from the Rfam database [50]. The alignments contained three to six sequences and had a mean pairwise identity between 50% and 100% (see Methods for details). For the tests of AlifoldZ and RNAz, shuffled alignments were used as negative controls. For obvious reasons, this is not possible here. So we used genomic alignments from random locations of the human genome (see Methods). Using the "genomic background" as negative controls in this test implies the assumption that the genome does not contain any structural RNAs at all, which is clearly not valid. However, if we assume true structural RNAs to be sparse in the genome this conservative assumption seems to be a sensible choice.
z-scores and classification performance
RNAz | AlifoldZ | SISSIz (mono) | SISSIz (di) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Data type | N | z | S_{0.01} | S_{0.05} | z | S_{0.01} | S_{0.05} | z | S_{0.01} | S_{0.05} | z | S_{0.01} | S_{0.05} |
5S rRNA | 368 | n/a | 0.77 | 0.98 | -6.72 | 0.84 | 0.98 | -6.35 | 0.86 | 0.98 | -6.35 | 0.93 | 1.00 |
tRNA | 382 | n/a | 0.74 | 0.98 | -6.29 | 0.75 | 0.98 | -6.24 | 0.74 | 0.98 | -5.86 | 0.88 | 0.99 |
U2 snRNA | 458 | n/a | 0.76 | 1.00 | -7.17 | 0.89 | 0.99 | -5.92 | 0.84 | 0.97 | -5.22 | 0.93 | 0.99 |
U3 snRNA | 377 | n/a | 0.52 | 0.92 | -5.11 | 0.74 | 0.86 | -4.47 | 0.69 | 0.83 | -4.23 | 0.76 | 0.86 |
U5 snRNA | 424 | n/a | 0.90 | 0.96 | -5.61 | 0.77 | 0.96 | -5.10 | 0.69 | 0.89 | -4.43 | 0.76 | 0.91 |
Hammerhead | 499 | n/a | 0.78 | 1.00 | -6.68 | 0.85 | 1.00 | -6.67 | 0.90 | 1.00 | -6.66 | 0.99 | 1.00 |
Group II intron | 480 | n/a | 0.68 | 0.82 | -6.58 | 0.74 | 0.81 | -6.77 | 0.72 | 0.81 | -6.29 | 0.77 | 0.82 |
micro RNA precursor | 571 | n/a | 0.75 | 1.00 | -8.89 | 1.00 | 1.00 | -8.84 | 1.00 | 1.00 | -7.58 | 1.00 | 1.00 |
Total of all classes | 3559 | n/a | 0.80 | 0.96 | -6.75 | 0.87 | 0.95 | -6.43 | 0.85 | 0.94 | -5.93 | 0.90 | 0.95 |
Genomic background | 3559 | n/a | n/a | n/a | -0.44 | n/a | n/a | -0.58 | n/a | n/a | -0.15 | n/a | n/a |
Using mononucleotide based randomization the z-scores of the genomic background are approximately half a standard deviation from zero (-0.44 and -0.58, for shuffling and mononucleotide simulation respectively). This shows the relatively strong "bias" of the genomic background that causes false positive predictions as shown in the previous section and in reference 21. Albeit the signal does not vanish completely, the dinucleotide based z-scores are much closer to zero (-0.15).
The z-scores of the structural RNAs in this test set are on average well below -4 indicating a clear structural signal. Also here, we observe that mononucleotide simulated z-scores are lower than the dinucleotide simulated ones. In this case, a dinucleotide content that favors stable RNA structures is clearly not only a general background effect of the genomic base composition but a feature of structural RNAs. However, this signal is lost if the more conservative dinucleotide based null model is used.
There is also a clear difference between the two monunucleotide randomization procedures: Shuffling leads to more significant z-scores than simulation. The main reason is the fact that simulation results in higher standard deviations than shuffling which in turn lead to more conservative z-scores.
The ROC curve shows that all the methods perform very well on this test set. The curve further suggests that there is not much difference between them. However, differences become evident when looking at the region of high specificity, the only relevant region for practical applications (see inset Fig. 9). Here, the dinucleotide based approach generally outperforms the mononucleotide based methods. The improvement is small but clearly noticeable: At a false positive rate of 0.01%, dinucleotide based simulation shows the highest sensitivity for 7 of the 8 RNA classes. For example, in the tRNA group the sensitivity is 13% higher than AlifoldZ and RNAz. The latter performs significantly worse than all other methods at this level. At a false positive rate of 0.05%, dinucleotide simulation still performs slightly better than mononucleotide shuffling/simulation but is on the same level as RNAz that performs significantly better here.
Discussion
Any experiment is only as good as its controls. What is true for experimental biology clearly also holds in the field of computational biology. The value of even the most sophisticated algorithm remains unclear if the significance of the results cannot be assessed properly. In this paper we addressed the problem of finding an adequate control strategy for comparative noncoding RNA predictions, which are started to get widely used for genome annotation.
Babak et al. demonstrated that currently used null models based on mononucleotide shuffling lead to an underestimation of the false positive rate in such screens. Although single opinions may be different [51], it is generally accepted that in the context of RNA gene prediction one should consider dinucleotide content as "background" rather than "signal". However, while there have been dinucleotide controlled randomization algorithms for single sequences for more than 20 years, it is a non-trivial problem in the case of multiple sequence alignments.
Here we devised a simulation procedure that produces alignments that have on average a given dinucleotide frequency and sequence diversity (globally and locally). The corresponding model needs to be relatively complex including overlapping dependencies and site-specific rates. Clearly, this model with a high number of parameters would not be a reasonable choice for use in phylogenetic analysis, but it turned out to be a good choice for this specific application.
We have to use heuristics and simplifications to estimate the tree and parameters for this model in reasonable time. The accuracy of our approach is measured in terms of how well the simulations reflect the properties of the original data. In this respect, we found that our strategy performs very well. Again, phylogenetic analysis was not the goal here, but some of the techniques introduced here might be of interest in this context. For example, we found that in the mononucleotide case our estimations for site-specific rates are surprisingly competitive when compared to the currently best maximum likelihood methods (data not shown).
The influence of the null model for genomic RNA predictions was found to be remarkable. Consistent with Babak and colleagues' findings on pairwise alignments, we observed three times more false positives using dinucleotide controls than using mononucleotide controls. This clearly shows that the new approach should be the method of choice to get more sensible estimates of the significance of comparative RNA predictions.
The next obvious step, is to use the new null model to improve current RNA gene prediction algorithms. In analogy to AlifoldZ, we combined our new simulation procedure with the RNAalifold consensus structure prediction algorithm. SISSIz calculates z-scores that are not biased by the genomic dinucleotide content and it is thus the first comparative gene finding program, that explicitely corrects for this effect. However, by using this conservative null model we also loose part of the signal in true structured RNAs. This might be the main reason, why the observed improvements in the overall classification performance were only relatively small.
In general, the support vector machine approach used by RNAz is preferable over the AlifoldZ approach, since it is orders of magnitude faster. However, it turned out to be difficult to create a dinucleotide based version of RNAz mainly for two reasons. Until now, there was no way to produce a dinucleotide controlled negative test set that is necessary for training the two class support vector machine [6]. With the method presented here, we have solved this problem and it is now possible to create test sets with specific dinucleotide properties. However, it remains an unsolved question how to compute dinucleotide based z-scores efficiently without shuffling. RNAz uses a regression approach to solve this problem for mononucleotides, which, unfortunately, does not scale well to the high dimensional dinucleotide case.
A promising alternative to the thermodynamic RNA prediction methods used in this paper, are probabilistic methods. The EvoFold algorithm [7] uses phylogenetic stochastic context-free grammars and, in its core, depends on a null model which is essentially an independent mononucleotide model. Since the folding grammar of EvoFold does not explicitely model stacking interactions there is no need for using a null model with overlapping dinucleotides as we have described here. However, also EvoFold was found to be affected to some degree by the dinucleotide content for reasons that are not immediately obvious [21]. A dinucleotide background model together with an advanced folding grammar that considers stacks can thus be expected to improve performance. However, it would require considerable effort to include such a null model into the sophisticated probabilistic framework of EvoFold.
Finally, we want to add that our randomization algorithm is not only of interest in the context of RNA gene prediction. It can be used for other comparative genomics applications whenever random alignments are needed as control. One could consider other applications in the context of RNA structures (e.g. prediction of conserved miRNA target sites) but also in different context (e.g. conserved sequence motifs). Currently our software implements a mono- and dinucleotide model which should be sufficient for many applications. In principle, however, it is also possible to consider higher order correlations within this framework.
Methods
Treating gaps
Gapped positions are treated as missing data. When counting the dinucleotide content, dinucleotides including a gap (N-, -N, --) are ignored. During simulation, gap positions are filled with nucleotides and gaps are re-introduced afterwards. Note that this way, if two nucleotides N_{1} and N_{2} are separated by a gap (e.g. N_{1}----N_{2}) the dinucleotide N_{1}N_{2} is not in equilibrium. Depending on how gaps are treated in the downstream analysis this might be or might not be of concern. In any case, since not every gap position but only every gap opened is affected, this (potential) error is generally very small for reasonable alignments. So we did not consider correcting for this effect which would require reconstructing the gap history and setting lineage specific neighbourhood systems.
For calculating the site-specific rates, we also treat gaps as missing data and calculate ⟨p_{ k }⟩ in eq. 10 only over non-gap positions. After the simulation, the whole column has on average ⟨p_{ k }⟩ estimated from the non-gap positions that does not change when originally gapped positions are masked again. For calculating the observed differences p between two sequences we set positions that includes gaps to the average ⟨p_{ k }⟩ at this site.
Distances above the level of saturation
When calculating genetic distances between two sequences the problem may occur that the observed number of differences is higher than the level of saturation. We found that this problem becomes severe when considering site-specific rates that generally lead to much lower levels of saturation (cf. Fig. 3B). We use a simple trick to overcome this limitation. We add additional sites during the simulation with site-specific rates that correspond to the average of the whole alignment (i.e. ⟨p_{ k }⟩ is set to 1-MPI in eq. 11 for all these additional sites). They act as "buffer sites" that reduces the number of mutation events that repeatedly hit the same sites of high rate leading to many double substitutions. As a consequence, the overall level of observed differences is higher and we do not run into problems building the distance matrix. In the end, the sites are removed again and since the relative rate ratios between the sites remained unchanged, we get the desired site-specific mutation patterns.
Limiting base composition variation
During the testing of the influence of the randomization procedure on RNA folding, we made an interesting observation. As expected, the variance of the folding energies of randomized data is higher with simulation than with shuffling. However, we also observed that there is difference in the mean. Simulation leads to slightly higher (i.e. less stable) folding energies than shuffling. We observed this behaviour not only on multiple alignments but also on single sequences using shuffling vs. first order Markov simulation. We suspect that extreme deviations in the base composition that can occur in simulated data do not symmetrically lead to the same deviations of the folding energies but preferentially impair the formation of RNA structures. To compensate for this effect, we have introduced an option in our software that only outputs simulated alignments, that are within a specific range of mononucleotide frequencies. We can thus limit our simulations to mononucleotide frequencies that are almost exactly as in the original data. As a distance measure we use the Euclidean distance $\sum}_{\alpha \in A,G,C,T}\sqrt{{({\pi}_{\alpha}^{s}-{\pi}_{\alpha})}^{2}$ with π_{ α }the desired frequency of nucleotide α in the original alignment, and ${\pi}_{\alpha}^{s}$ the observed frequency in the simulation. For all the data shown in Figs. 8, 9 and Tab. 1 we used simulations with this cutoff set to 0.05.
Software
For the simulations in Fig. 1 we used seq-gen version 1.3.2 [52, 53]. Monunucleotide shuffling was carried out using shuffle-aln.pl with option "--mode conservative2". Together with alifoldz.pl it is available online [54]. For the tests in Figs. 1 and 8 we used RNAalifold from the Vienna RNA package [55] version 1.6.1. with options "-nc 0 -cv 0") and RNAz [56] version 1.0 with standard parameters. For implementation of our software we used a series of third party C-code that is available as open source: levmar [57] by Manolis Lourakis for least squares fitting, BIONJ [41, 58] by Olivier Gascuel, PHYML [59, 60] by Stéphane Guindon and Olivier Gascuel for maximum likelihood estimation of the transition/transversion rate, Vienna RNA package [55] by Ivo L. Hofacker and others for consensus folding in SISSIz.
Sequence data
For the benchmark we used sequences from the following eight Rfam families: RF00001 (5S rRNA), RF00004 (U2 snRNA), RF00005 (tRNA), RF00008 (Hammerhead ribozyme), RF00012 (U3 snRNA), RF00020 (U5 snRNA), RF00029 (Group II intron), RF00104 (mir-10 precursor). From these sequences, a set of non-redundant alignments between 3 and 6 sequences per alignment and mean pairwise identity between appr. 50% and 100% was created as described [5, 6]. The families were chosen because they represent different structural families and contain enough sequences to create sets of reasonable sample size.
Genomic alignments were extracted from Multiz 17-way vertebrate alignments available at the UCSC genome browser [61, 62]. For creating the set of 1000 alignments used for Figs. 6 and 8, we used the rnazWindow.pl script from the RNAz software package [56, 63] to get typical alignment blocks as used previously in genomic ncRNA screens [20] or [14]. For the benchmark we selected for each positive example of the structural RNA set a negative example from the genomic alignments. Subsets of sequences were chosen to get the same number of sequences and the same mean pairwise identity (± 0.05) as the structural RNA counterpart. Also the alignment length was adjusted accordingly (limited to a maximum length of 150).
Declarations
Acknowledgements
We thank Arndt von Haeseler, Ivo Hofacker, and Nick Goldman for discussions and Roland Fleißner for comments on the manuscript. We acknowledge funding from the Austrian GEN-AU projects "noncoding RNA" and "Bioinformatics Integration Network" and Financial support to the CIBIV institute from the Wiener Wissenschafts-, Forschungs- and Technologiefonds (WWTF).
Authors’ Affiliations
References
- Griffiths-Jones S: Annotating noncoding RNA genes. Annu Rev Genomics Hum Genet 2007, 8: 279–298. 10.1146/annurev.genom.8.080706.092419View ArticlePubMedGoogle Scholar
- Athanasius F Bompfünewerer Consortium, Backofen R, Bernhart SH, Flamm C, Fried C, Fritzsch G, Hackermüller J, Hertel J, Hofacker IL, Missal K, Mosig A, Prohaska SJ, Rose D, Stadler PF, Tanzer A, Washietl S, Will S: RNAs everywhere: genome-wide annotation of structured RNAs. J Exp Zoolog B Mol Dev Evol 2007, 308: 1–25. 10.1002/jez.b.21130View ArticleGoogle Scholar
- Rivas E, Eddy SR: Noncoding RNA gene detection using comparative sequence analysis. BMC Bioinformatics 2001, 2: 8–8. 10.1186/1471-2105-2-8PubMed CentralView ArticlePubMedGoogle Scholar
- Coventry A, Kleitman DJ, Berger B: MSARi: multiple sequence alignments for statistical detection of RNA secondary structure. Proc Natl Acad Sci USA 2004, 101(33):12102–12107. 10.1073/pnas.0404193101PubMed CentralView ArticlePubMedGoogle Scholar
- Washietl S, Hofacker IL: Consensus folding of aligned sequences as a new measure for the detection of functional RNAs by comparative genomics. J Mol Biol 2004, 342: 19–30. 10.1016/j.jmb.2004.07.018View ArticlePubMedGoogle Scholar
- Washietl S, Hofacker IL, Stadler PF: Fast and reliable prediction of noncoding RNAs. Proc Natl Acad Sci USA 2005, 102(7):2454–2459. 10.1073/pnas.0409169102PubMed CentralView ArticlePubMedGoogle Scholar
- Pedersen JS, Bejerano G, Siepel A, Rosenbloom K, Lindblad-Toh K, Lander ES, Kent J, Miller W, Haussler D: Identification and classification of conserved RNA secondary structures in the human genome. PLoS Comput Biol 2006., 2(4):
- Yao Z, Weinberg Z, Ruzzo WL: CMfinder-a covariance model based RNA motif finding algorithm. Bioinformatics 2006, 22(4):445–452. 10.1093/bioinformatics/btk008View ArticlePubMedGoogle Scholar
- Uzilov AV, Keegan JM, Mathews DH: Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change. BMC Bioinformatics 2006, 7: 173. 10.1186/1471-2105-7-173PubMed CentralView ArticlePubMedGoogle Scholar
- Torarinsson E, Sawera M, Havgaard JH, Fredholm M, Gorodkin J: Thousands of corresponding human and mouse genomic regions unalignable in primary sequence contain common RNA structure. Genome Res 2006, 16(7):885–9. 10.1101/gr.5226606PubMed CentralView ArticlePubMedGoogle Scholar
- Washietl S, Hofacker IL, Lukasser M, Hüttenhofer A, Stadler PF: Mapping of conserved RNA secondary structures predicts thousands of functional noncoding RNAs in the human genome. Nat Biotechnol 2005, 23(11):1383–1390. 10.1038/nbt1144View ArticlePubMedGoogle Scholar
- Missal K, Rose D, Stadler PF: Non-coding RNAs in Ciona intestinalis. Bioinformatics 2005, 21(Suppl 2):ii77–78. 10.1093/bioinformatics/bti1113View ArticlePubMedGoogle Scholar
- Missal K, Zhu X, Rose D, Deng W, Skogerbo G, Chen R, Stadler PF: Prediction of structured non-coding RNAs in the genomes of the nematodes Caenorhabditis elegans and Caenorhabditis briggsae. J Exp Zoolog B Mol Dev Evol 2006, 306(4):379–392. 10.1002/jez.b.21086View ArticleGoogle Scholar
- Rose D, Hackermueller J, Washietl S, Reiche K, Hertel J, Findeiss S, Stadler PF, Prohaska SJ: Computational RNomics of Drosophilids. BMC Genomics 2007, 8: 406. 10.1186/1471-2164-8-406PubMed CentralView ArticlePubMedGoogle Scholar
- Axmann IM, Kensche P, Vogel J, Kohl S, Herzel H, Hess WR: Identification of cyanobacterial non-coding RNAs by comparative genome analysis. Genome Biol 2005., 6(9):
- Weile C, Gardner PP, Hedegaard MM, Vinther J: Use of tiling array data and RNA secondary structure predictions to identify noncoding RNA genes. BMC Genomics 2007, 8: 244–244. 10.1186/1471-2164-8-244PubMed CentralView ArticlePubMedGoogle Scholar
- del Val C, Rivas E, Torres-Quesada O, Toro N, Jiménez-Zurdo JI: Identification of differentially expressed small non-coding RNAs in the legume endosymbiont Sinorhizobium meliloti by comparative genomics. Mol Microbiol 2007, 66(5):1080–1091. 10.1111/j.1365-2958.2007.05978.xPubMed CentralView ArticlePubMedGoogle Scholar
- Mourier T, Carret C, Kyes S, Christodoulou Z, Gardner PP, Jeffares DC, Pinches R, Barrell B, Berriman M, Griffiths-Jones S, Ivens A, Newbold C, Pain A: Genome-wide discovery and verification of novel structured RNAs in Plasmodium falciparum. Genome Res 2007.Google Scholar
- Sandmann T, Cohen SM: Identification of Novel Drosophila melanogaster MicroRNAs. PLoS ONE 2007., 2(11):Google Scholar
- Washietl S, Pedersen JS, Korbel JO, Stocsits C, Gruber AR, Hackermüller J, Hertel J, Lindemeyer M, Reiche K, Tanzer A, Ucla C, Wyss C, Antonarakis SE, Denoeud F, Lagarde J, Drenkow J, Kapranov P, Gingeras TR, Guigó R, Snyder M, Gerstein MB, Reymond A, Hofacker IL, Stadler PF: Structured RNAs in the ENCODE selected regions of the human genome. Genome Res 2007, 17(6):852–864. 10.1101/gr.5650707PubMed CentralView ArticlePubMedGoogle Scholar
- Babak T, Blencowe BJ, Hughes TR: Considerations in the identification of functional RNA structural elements in genomic alignments. BMC Bioinformatics 2007, 8: 33. 10.1186/1471-2105-8-33PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Erickson BW: Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage. Mol Biol Evol 1985, 2(6):526–538.PubMedGoogle Scholar
- Clote P, Ferré F, Kranakis E, Krizanc D: Structural RNA has lower folding energy than random RNA of the same dinucleotide frequency. RNA 2005, 11(5):578–591. 10.1261/rna.7220505PubMed CentralView ArticlePubMedGoogle Scholar
- Workman C, Krogh A: No evidence that mRNAs have lower folding free energies than random sequences with the same dinucleotide distribution. Nucleic Acids Res 1999, 27(24):4816–4822. 10.1093/nar/27.24.4816PubMed CentralView ArticlePubMedGoogle Scholar
- Hasegawa M, Kishino H, Yano T: Dating of the Human-Ape Splitting by a Molecular Clock of Mitochondrial DNA. J Mol Evol 1985, 22: 160–174. 10.1007/BF02101694View ArticlePubMedGoogle Scholar
- Hofacker IL, Fekete M, Stadler PF: Secondary structure prediction for aligned RNA sequences. J Mol Biol 2002, 319(5):1059–1066. 10.1016/S0022-2836(02)00308-XView ArticlePubMedGoogle Scholar
- Felsenstein J: Inferring Phylogenies. Sunderland, Massachusetts: Sinauer Associates; 2004.Google Scholar
- Tavaré S: Some probabilistic and statistical problems on the analysis of DNA sequences. Lec Math Life Sci 1986, 17: 57–86.Google Scholar
- Jensen J, Pedersen AM: Probabilistic models of DNA sequence evolution with context dependent rates of substitution. Adv Appl Prob 2000, 32: 499–517. 10.1239/aap/1013540176View ArticleGoogle Scholar
- Duret L, Galtier N: The covariation between TpA deficiency, CpG deficiency, and G+C content of human isochores is due to a mathematical artifact. Mol Biol Evol 2000, 17(11):1620–1625.View ArticlePubMedGoogle Scholar
- Pedersen AM, Jensen J: A dependent rates model and MCMC based methodology for the maximum likelihood analysis of sequences with overlapping reading frames. Mol Biol Evol 2001, 18: 763–776.View ArticlePubMedGoogle Scholar
- Arndt PF, Burge CB, Hwa T: DNA sequence evolution with neighbor-dependent mutation. J Comput Biol 2003, 10: 313–322. 10.1089/10665270360688039View ArticlePubMedGoogle Scholar
- Robinson DM, Jones DT, Kishino H, Goldman N, Thorne JL: Protein Evolution with Dependence Among Codons Due to Tertiary Structure. Mol Biol Evol 2003, 20: 1692–1704. 10.1093/molbev/msg184View ArticlePubMedGoogle Scholar
- Siepel A, Haussler D: Phylogenetic Estimation of Context-Dependent Substitution Rates by Maximum Likelihood. Mol Biol Evol 2004, 21: 468–488. 10.1093/molbev/msh039View ArticlePubMedGoogle Scholar
- Lunter G, Hein J: A nucleotide substitution model with nearest-neighbour interactions. Bioinformatics 2004, 20: i216-i223. 10.1093/bioinformatics/bth901View ArticlePubMedGoogle Scholar
- Christensen OF: Pseudo-likelihood for non-reversible nucleotide substitution models with neighbor dependent rates. Stat Appl Genet Mol Biol 2006, 5: 1–29.Google Scholar
- Gesell T, von Haeseler A: In silico sequence evolution with site-specific interactions along phylogenetic trees. Bioinformatics 2006, 22: 716–722. 10.1093/bioinformatics/bti812View ArticlePubMedGoogle Scholar
- Schöniger M, von Haeseler A: A Stochastic Model for the Evolution of Autocorrelated DNA sequences. Mol Phylogenet Evol 1994, 3: 240–247. 10.1006/mpev.1994.1026View ArticlePubMedGoogle Scholar
- Schöniger M, von Haeseler A: Simulating efficiently the evolution of DNA sequences. Comput Appl Biosci 1995, 11: 111–115.PubMedGoogle Scholar
- Lanave C, Preparata G, Saccone C, Serio G: A new method for calculating evolutionary substitution rates. J Mol Evo 1984, 20: 86–93. 10.1007/BF02101990View ArticleGoogle Scholar
- Gascuel O: BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Mol Biol Evol 1997, 14(7):685–695.View ArticlePubMedGoogle Scholar
- Peer Y, Baldauf SL, Doolittle WF, Meyer A: An updated and comprehensive rRNA phylogeny of (crown) eukaryotes based on rate-calibrated evolutionary distances. J Mol Evol 2000, 51(6):565–576.View ArticlePubMedGoogle Scholar
- Thorne J, Kishino H, Felsenstein J: An Evolutionary Model for Maximum Likelihood Alignment of DNA Sequences. J Mol Evol 1991, 33: 114–124. 10.1007/BF02193625View ArticlePubMedGoogle Scholar
- Thorne J, Kishino H, Felsenstein J: Inching toward reality: An improved likelihood model of sequence evolution. J Mol Evol 1992, 34: 3–16. 10.1007/BF00163848View ArticlePubMedGoogle Scholar
- Metzler D: Statistical alignment based on fragment insertion and deletion models. Bioinformatics 2003, 19: 490–499. 10.1093/bioinformatics/btg026View ArticlePubMedGoogle Scholar
- Miklós I, Lunter G, Holmes I: A "Long Indel" Model For Evolutionary Sequence Alignment. Mol Biol Evol 2004, 21: 529–540. 10.1093/molbev/msh043View ArticlePubMedGoogle Scholar
- Fleißner R, Metzler D, von Haeseler A: Simultaneous Statistical Alignment and Phylogeny Reconstruction. Syst Biol 2005, 54: 548–561. 10.1080/10635150590950371View ArticlePubMedGoogle Scholar
- SISSIz[http://sourceforge.net/projects/sissiz]
- Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, Haussler D, Miller W: Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res 2004, 14(4):708–715. 10.1101/gr.1933104PubMed CentralView ArticlePubMedGoogle Scholar
- Griffiths-Jones S, Moxon S, Marshall M, Khanna A, Eddy SR, Bateman A: Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res 2005, (33 Database):D121–4.Google Scholar
- Forsdyke DR: Calculation of folding energies of single-stranded nucleic acid sequences: conceptual issues. J Theor Biol 2007, 248(4):745–753. 10.1016/j.jtbi.2007.07.008View ArticlePubMedGoogle Scholar
- Seq-Gen[http://tree.bio.ed.ac.uk/software/seqgen]
- Rambaut A, Grassly NC: Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput Appl Biosci 1997, 13(3):235–238.PubMedGoogle Scholar
- AlifoldZ/shuffle-aln[http://www.tbi.univie.ac.at/papers/SUPPLEMENTS/Alifoldz]
- The Vienna RNA package[http://www.tbi.univie.ac.at/~ivo/RNA]
- RNAz – predicting structural noncoding RNAs[http://www.tbi.univie.ac.at/~wash/RNAz]
- levmar: Levenberg-Marquardt nonlinear least squares algorithms in C/C++[http://www.ics.forth.gr/~lourakis/levmar]
- BIONJ[http://www.lirmm.fr/~w3ifa/MAAS/BIONJ/]
- PhyML[http://atgc.lirmm.fr/phyml]
- Guindon S, Gascuel O: A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol 2003, 52(5):696–704. 10.1080/10635150390235520View ArticlePubMedGoogle Scholar
- UCSC genome browser[http://genome.ucsc.edu]
- Karolchik D, Kuhn RM, Baertsch R, Barber GP, Clawson H, Diekhans M, Giardine B, Harte RA, Hinrichs AS, Hsu F, Kober KM, Miller W, Pedersen JS, Pohl A, Raney BJ, Rhead B, Rosenbloom KR, Smith KE, Stanke M, Thakkapallayil A, Trumbower H, Wang T, Zweig AS, Haussler D, Kent WJ: The UCSC Genome Browser Database: 2008 update. Nucleic Acids Res 2007.Google Scholar
- Washietl S: Prediction of Structural Noncoding RNAs With RNAz. Methods Mol Biol 2007, 395: 503–526.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.