- Research article
- Open Access
TurboFold: Iterative probabilistic estimation of secondary structures for multiple RNA sequences
- Arif O Harmanci^{1, 4},
- Gaurav Sharma^{1, 3, 4}Email author and
- David H Mathews^{2, 3, 4}
https://doi.org/10.1186/1471-2105-12-108
© Harmanci et al; licensee BioMed Central Ltd. 2011
- Received: 3 September 2010
- Accepted: 20 April 2011
- Published: 20 April 2011
Abstract
Background
The prediction of secondary structure, i.e. the set of canonical base pairs between nucleotides, is a first step in developing an understanding of the function of an RNA sequence. The most accurate computational methods predict conserved structures for a set of homologous RNA sequences. These methods usually suffer from high computational complexity. In this paper, TurboFold, a novel and efficient method for secondary structure prediction for multiple RNA sequences, is presented.
Results
TurboFold takes, as input, a set of homologous RNA sequences and outputs estimates of the base pairing probabilities for each sequence. The base pairing probabilities for a sequence are estimated by combining intrinsic information, derived from the sequence itself via the nearest neighbor thermodynamic model, with extrinsic information, derived from the other sequences in the input set. For a given sequence, the extrinsic information is computed by using pairwise-sequence-alignment-based probabilities for co-incidence with each of the other sequences, along with estimated base pairing probabilities, from the previous iteration, for the other sequences. The extrinsic information is introduced as free energy modifications for base pairing in a partition function computation based on the nearest neighbor thermodynamic model. This process yields updated estimates of base pairing probability. The updated base pairing probabilities in turn are used to recompute extrinsic information, resulting in the overall iterative estimation procedure that defines TurboFold.
TurboFold is benchmarked on a number of ncRNA datasets and compared against alternative secondary structure prediction methods. The iterative procedure in TurboFold is shown to improve estimates of base pairing probability with each iteration, though only small gains are obtained beyond three iterations. Secondary structures composed of base pairs with estimated probabilities higher than a significance threshold are shown to be more accurate for TurboFold than for alternative methods that estimate base pairing probabilities. TurboFold-MEA, which uses base pairing probabilities from TurboFold in a maximum expected accuracy algorithm for secondary structure prediction, has accuracy comparable to the best performing secondary structure prediction methods. The computational and memory requirements for TurboFold are modest and, in terms of sequence length and number of sequences, scale much more favorably than joint alignment and folding algorithms.
Conclusions
TurboFold is an iterative probabilistic method for predicting secondary structures for multiple RNA sequences that efficiently and accurately combines the information from the comparative analysis between sequences with the thermodynamic folding model. Unlike most other multi-sequence structure prediction methods, TurboFold does not enforce strict commonality of structures and is therefore useful for predicting structures for homologous sequences that have diverged significantly. TurboFold can be downloaded as part of the RNAstructure package at http://rna.urmc.rochester.edu.
Keywords
- Positive Predictive Value
- Extrinsic Information
- Turbo Decode
- Synechococcus PCC6717
- Base Pairing Probability
Background
The discovery that RNA can directly regulate chemical reactions in a cell without being translated into, or coding for, a protein has radically altered the understanding of RNA function [1, 2]. Many types of such non-coding RNAs (ncRNAs) have been identified, with roles in diverse cellular activities [3, 4] and it is predicted that numerous ncRNAs are yet to be identified [4–8].
Correct determination of the secondary structure of a ncRNA, i.e., the canonical base pairing interactions between the nucleotides, is important for understanding the chemical basis for its function [9]. In addition, accurate prediction of RNA secondary structure also improves computational methods that scan genomes for novel ncRNA genes [4, 10–14] because these methods utilize structure prediction to test for conserved secondary structure across genomes, which, in turn suggests that the sequence regions corresponding to conserved structural regions form homologous ncRNA genes.
A number of alternative techniques have been proposed for RNA secondary structure prediction - a process that is commonly referred to as RNA folding [15, 16]. For folding a single RNA sequence, the state of the art method utilizes a thermodynamic model that predicts molecular stability for a given set of base pairing interactions using a nearest neighbor model [17–20]. When multiple RNA homologs that share a common secondary structure are available, significantly higher accuracy can be obtained by folding these multiple sequences together to find the conserved structure. In fact, comparative sequence analysis methods [21] that utilize a large number of homologs for RNA folding, currently offer the most accurate prediction of secondary structure. Comparative sequence analysis takes as input multiple homologous RNA sequences and predicts a consensus secondary structure. The analysis is an iterative process, where the sequences are aligned and conserved base pairs are identified between columns of the alignment. Then the pairing information is utilized to refine the alignment of the sequences in the next iteration. Comparative sequence analysis aims at combining the folding of individual sequences and the alignment between the sequences to determine the consensus structure. The method is, however, manual and time consuming. Computational methods for structure prediction using multiple homologous sequences can be thought of as attempts to automate comparative sequence analysis, typically with a much smaller number of input sequences. A recent comprehensive review of computational methods for structure prediction for multiple sequences can be found in [22].
The estimated posterior probabilities of base pairing output by TurboFold can be utilized for predicting the secondary structure of the sequences, either by thresholding the probabilities to obtain structures composed of base pairs with estimated pairing probabilities higher than a desired threshold or by using the estimated posterior probabilities in a maximum expected accuracy (MEA) secondary structure prediction algorithm [24–26]. The latter algorithm is termed TurboFold-MEA. While TurboFold predicts the secondary structures for the multiple sequences collectively using information from all sequences, it does not do so with a rigid definition of common secondary structure for the collection of sequences. Thus TurboFold permits variable folding domains that are seen in some of the sequences and not in others, a scenario that is not uncommon in ncRNA sequences that are homologous despite the minor variations in their secondary structure topology.
Benchmarking results demonstrate that the base pairing probability estimates of TurboFold are more accurate than alternative methods that provide such estimates, i.e. for a given sensitivity, the base pair predictions obtained by thresholding the estimated probabilities from TurboFold have a higher positive predictive value (PPV) than the alternative methods. Secondary structure prediction using TurboFold-MEA also provides among the highest accuracy across the secondary structure prediction methods benchmarked. Specifically, for ncRNA families with significant structural variation, TurboFold-MEA has a higher sensitivity than other methods at similar PPV. For other ncRNA families, the results of TurboFold-MEA are comparable to the best performing methods. The computation time and memory requirements of TurboFold are modest and comparable to, or lower than, those for other methods with comparable accuracy, with the exception of RAF [27], which is faster.
In the next section, TurboFold is presented as an iterative algorithm that alternates between computations of a) extrinsic information and b) a modified partition function that yields estimates of posterior base pairing probabilities. Within the section, a description is also provided for methods for prediction of secondary structures from base pairing probability estimates, either by composing structures made from base pairs with estimated probabilities higher than a chosen threshold or by using the MEA methodology. The Results section benchmarks the performance of TurboFold and TurboFold-MEA against other secondary structure prediction methods with regard to structure prediction accuracy and resource (time and memory) requirements. The Discussion section presents the motivation for the proposed method and the nomenclature by exploring connections with Turbo-decoding [28] and presents an example that illustrates TurboFold's ability to allow variable structural elements across input sequences. The relation of TurboFold to existing multi-sequence methods for prediction of RNA secondary structure is also addressed within the Discussion section.
Methods
TurboFold takes as input K RNA sequences denoted by x_{1}, x_{2}, ..., x_{ K } or where denotes the set of sequence indices. The length of the m^{th} sequence x_{ m } is denoted by N_{ m } . Thus the sequence x_{ m } consists of an sequence of N_{ m } nucleotides ordered from the 5' to the 3' end, where each nucleotide takes values from the alphabet set {A, U, G, C} based on its identifying nitrogenous base. A secondary structure S_{ m } on an RNA sequence x_{ m } is represented as the set {(i_{ l }, j_{ l } )} _{ l } of pairs (i_{ l }, j_{ l } ) of nucleotide indices i_{ l } , j_{ l } corresponding to the base pairs in the secondary structure, where the subscript l indexes the base pairs in the structure. By convention, 1 ≤ i < j ≤ N_{ m } and each nucleotide position can participate in at most one base pair. Furthermore, as is common, for computational reasons, it is assumed that the base pairs within a structure satisfy the pseudoknot free condition, i.e. for any four nucleotide indices 1 ≤ i_{1} < i_{2} < j_{1} < j_{2} ≤ N_{ m } , both (i_{1}, j_{1}) and (i_{2}, j_{2}) cannot be base pairs in S _{ m }.
The steps in TurboFold are listed in Algorithm 1. The ensuing description first provides a high-level overview which is followed by details of the individual modules within the algorithm. The notational convention denotes probabilities by π and matrices of probability entries by Π. Terms analogous to, but not strictly, probabilities are denoted as and , respectively, in their scalar and matrix forms. The association of these terms with a sequence or a pair of sequences is indicated by adding superscripts comprised of a single sequence index or a two-tuple of sequence indices. Pre-subscripts of p and c indicate that they are associated with pairing and co-incidence events, respectively. Finally, if required, a pre-superscript denotes the iteration index.
Prior to commencing the iterations, pairwise posterior co-incidence probability matrices _{ c } Π^{(s,m)}and pairwise sequence identities ψ_{m,s}are computed for each pair of sequences (m, s), m, , m ≠ s. Specifically, _{ c } Π^{(m,s)}is an N_{ m } × N_{ s } matrix whose ik^{th} entry _{ c }π^{(m,s)}(i, k) is the posterior probability that nucleotide at index i in x_{ m } is co-incident with the nucleotide at index k in x _{ s }. The sequence identity, ψ_{m,s}, is computed as the fraction of positions, along the maximum likelihood alignment path, in which the nucleotides for sequence x_{ m } and x_{ s } match. The posterior co-incidence probability matrices _{ c } Π^{(s,m)}and sequence identity ψ_{m,s}are computed efficiently in TurboFold using a pairwise alignment Hidden Markov Model (HMM) [23, 29], which requires O(N_{ m }N_{ s } ) operations and storage for each sequence pair.
Extrinsic Information Computation
performance penalty. This is indicated in (1) by constraining the indices k and l to constraint sets and , respectively, where denotes the set of indices for which the posterior co-incidence probabilities _{ c }π^{(m,s)}(i, k) exceed a chosen, sufficiently low, significance threshold, and is similarly defined. The computation of these sets of constrained co-incident indices is described in detail in [23]. If, over all choices of sequence pairs (m, s), the average number of elements in the set (and ) is d, then the computation of a term in one of the matrices requires (d^{2}) operations on average. It is worth noting that without the constraints for indices k and l, the evaluation of induced probabilities in (1) could be expressed as two matrix multiplications, , which would require operations per entry.
The use of co-incidence, rather than alignment, probabilities for the generation of extrinsic information is motivated by the fact that the coincidence probabilities, which are the sum of probabilities for matching, insertion and deletion events in the alignment, propagate pairing proclivities to inserted base pairs that change the lengths of helices, whereas alignment probabilities would restrict the extrinsic information to only the conserved base pairs.
The asymptotic time complexity of extrinsic information computation for all sequences is O(K^{2}d^{2}N^{2}), where N is the longest sequence length. The memory complexity is O(KN^{2}) for storage of the extrinsic information matrix for the set of K sequences.
Modified Partition Function for Updating Base Pairing Probabilities
The base pairing probability matrix is computed efficiently via a modification of the dynamic programming algorithm for partition function calculation [30, 31] that uses the nearest neighbor thermodynamic model. Specifically, the pseudo-free energy term in (5) represents an a priori probability for the base pair (i, j), which in the modified dynamic programming algorithm contributes an addition of the pseudo free energy when considering pairing between nucleotides (i, j). The computation of modified partition function for all sequences has O(KN^{3}) time complexity and O(KN^{2}) memory complexity, where N is the longest sequence length.
Structure Prediction Utilizing the Base Pairing Probabilities
composed of base pairs deemed significant. Any choice of P_{thresh} greater than 0.5 guarantees that is a valid secondary structure [31]. For P_{thresh} ≤ 0.5, may contain base pairs that form pseudoknots or may contain multiple base pairs for a nucleotide.
Time and Space Complexity
The time and space complexity of TurboFold can be described in terms of the operations required for the one time initialization and the operations required for the η computationally identical iterations. For the initialization, the estimation of posterior co-incidence probability matrices and the pairwise sequence identities for all sequence pairs requires O(K^{2}N^{2}) computations. In order to store the co-incidence probability matrices computed in the initialization, O(K^{2}dN) memory is required. Over the η iterations, for all the sequences, updates of the extrinsic information require O(ηK^{2}N^{2}d^{2}) computations and the modified partition function evaluations require O(ηKN^{3}) computations. The storage requirement for the iterations is O(KN^{2}). These requirements for TurboFold can be contrasted with Sankoff's algorithm, which requires O(N^{3}d^{ K } ) computations and O(N^{2}d^{ K } ) memory, when used with a banded constraint on the nucleotide alignments for reducing computation by "cutting corners" [32]. Thus, the time and memory requirements for Sankoff's algorithm increase exponentially with increasing number of input sequences, whereas the time requirement for TurboFold increases proportional to the square of the number of input sequences, and memory requirement increases linearly with the number of input sequences.
It should be noted that, in each iteration, the base pairing probability computations for each sequence are performed independently. Therefore the base pairing probabilities for all sequences can be computed in parallel using K processors. In the current implementation of TurboFold, the user can specify the number of threads that will be used to compute the base pairing probabilities in parallel. The POSIX threads library is utilized for implementation of parallel computations.
Measures for Accuracy of Predicted Structures
The structure prediction accuracy is evaluated in terms of sensitivity and positive predictive value (PPV) of the predictions. For a sequence x _{ m }, the sensitivity of the predicted structure is the ratio of number of correctly predicted base pairs to the number of base pairs in the known structure and the PPV is the ratio of the number of correctly predicted base pairs to the number of base pairs in the predicted structure. A base pair between nucleotides at i_{ m } and j_{ m } in the predicted structure is assumed to be correctly predicted if there is a base pair (i_{ m } , j_{ m } ) or (i_{ m } - 1, j_{ m } ) or (i_{ m } + 1, j_{ m } ) or (i_{ m } , j_{ m } -1) or (i_{ m } , j_{ m } + 1) in the known structure, which is consistent with prior methodology for accuracy assessment [18, 33, 34]. This scoring reflects the uncertainty in structure determination by comparative analysis and thermal fluctuations in structure.
Selection of Parameters
Results
Three sets of experiments are performed for comparing TurboFold with other programs: 1) Experiments for assessing accuracy of structures predicted from thresholding of base pairing probabilities as computed by TurboFold; 2) Experiments for assessing accuracy of structures predicted from TurboFold-MEA; 3) Experiments for comparing time and memory requirements of TurboFold with other programs. Datasets for benchmarking experiments are generated as follows: 200 RNase P sequences are randomly selected from the RNase P Database [37], then the sequences are split into sets of K sequences such that 2 ≤ K ≤ 10. The average sequence length is 336 nucleotides and the average pairwise identity, as determined from the alignments computed by ClustalW 2.0.11 [38], is 50%. The random selection and division into combinations of K sequences (for 2 ≤ K ≤ 10) is also performed with 200 tmRNA sequences [39, 40] (average length of 366 nucleotides and average pairwise identity of 45%), and 30 telomerase RNA sequences [41] (445 nucleotides and 54% pairwise identity), 400 SRP sequences from the SRPDB [42] (187 nucleotides and 42% pairwise identity), 400 tRNA sequences from the compilation of tRNA sequences by Sprinzl et al. [35] (77 nucleotides and 47% pairwise identity), and 400 5S rRNA sequences from the 5S Ribosomal RNA database [36] (119 nucleotides and 63% pairwise identity). This procedure yields 9 datasets for each family and 54 datasets in total. The datasets are available as Additional File 3
Performance Benchmarks for Estimated Base Pairing Probabilities
- 1.
LocARNA [43] is structural alignment algorithm for multiple sequences that utilizes pairwise structural alignment computations progressively for prediction of the structural alignment. Version 1.5.2a is utilized, with Vienna RNA Software Package version 1.8.4, in probabilistic mode to generate base matching probabilities with consistency transformation ('-probabilistic -consistency-transformation' option). Given K input sequences, the single sequence reliabilities as computed by LocARNA are utilized as estimates of base pairing probabilities.
- 2.
RNAalifold [44] is a structure prediction algorithm that takes a sequence alignment of the input sequences. The structures are predicted via maximization of a score that is based on free energy changes and covariation from the sequence alignment. RNAalifold also estimates the base pairing probabilities for sequences via computation of a partition function for the alignment. The version included in Vienna RNA Software Package version 1.8.4 is utilized with command line option '-p' for computation of base pairing probabilities with ClustalW 2.0.11 [38] for computation of input sequence alignment.
- 3.
Single sequence partition function computation [30, 31], which computes the base pairing probabilities of a given RNA sequence in the equilibrium ensemble of secondary structures. The partition function computation as implemented in RNAstructure version 4.5 [31, 45] are utilized in benchmark experiments.
For the RNase P, tmRNA, telomerase RNA, and SRP datasets, TurboFold has higher sensitivity for a fixed PPV, and higher PPV for a fixed sensitivity than the other methods. In addition, the PPV versus sensitivity plot for TurboFold approaches the top right corner, corresponding to ideal (sensitivity, PPV) = (1.0, 1.0), closer than the other three methods evaluated. The accuracy of TurboFold and LocARNA are comparable over tRNA datasets. Over 5S rRNA datasets, the accuracy of TurboFold is comparable to that of RNAalifold. The prediction accuracy of RNAalifold, however, depends significantly on the accuracy of the input alignment computed by ClustalW. Over datasets with high average pairwise identity, which are easier to align, predictions of RNAalifold are higher in accuracy than over datasets with lower average pairwise identity. Figure 7 illustrates this: Compared to other methods, the accuracy of RNAalifold predictions is highest for the 5S rRNA, whose average pairwise identity is significantly higher than average identities of other datasets. Additionally, the accuracy of RNAalifold for the K = 10 dataset is lower than for the K = 3 datasets when average sequence identity is low. TurboFold demonstrates a better performance with K = 10 than with K = 3 for all sequence families, as expected.
Structure Prediction Accuracy of TurboFold-MEA
- 1.
RAF [27] is a structural alignment algorithm that utilizes progressive pairwise alignments to predict the structural alignment. RAF utilizes a simple scoring scheme based on base pairing probabilities (as computed by CONTRAfold 2.02 [25]), alignment probabilities (as computed by CONTRAlign 2.01 [46]), and a set of weights learned from a dataset of multiple structural alignment dataset for structural alignment prediction. Version 1.0 is utilized with the default command line option for prediction ('-predict' option).
- 2.
LocARNA [43] Version 1.5.2a (with Vienna RNA Software Package version 1.8.4) is utilized.
- 3.
CentroidAlifold [47, 48] is a structural alignment method that takes an input sequence alignment and combines the base pairing information and input sequence alignment to predict structures for each sequence. The input sequence alignment is generated by ClustalW 2.0.11 [38].
- 4.
RNASampler [49] is an iterative sampling algorithm that predicts conserved helices in input sequences for structure prediction. RNA Sampler was used with default options.
- 5.
RNAcast [50] analyzes the folding space of input sequences in terms of abstract shapes and finds the optimal abstract shape that is common for all the structures and uses the optimal shape to generate consensus secondary structure. RNAcast is used with 40% free energy energy cut-off threshold, as in [34], because RNAcast fails to determine consensus structures for some datasets for higher thresholds.
- 6.
FOLDALIGNM [51] is a method for progressive structural alignment of RNA sequences. FOLDALIGNM version 1.0.1 is run with FOLDALIGN version 2.1.1 [52]. The java heap space is set to 10 gigabytes (with '-x 10000' option).
- 7.
MASTR [53] is a Markov chain Monte Carlo algorithm for structural alignment of a given set of RNA sequences. The default command line options are used for MASTR.
- 8.
MXScarna [54] is a method for structural alignment of multiple RNA sequences. MXScarna progressively aligns the sequences using an efficient pairwise structural alignment algorithm for determining the set of stems in the sequences that optimizes a scoring function evaluated from precomputed probabilities of base pairing and alignment. Version 2.1 is used in the predictions.
- 9.
CentroidHomfold [55] is a method that takes as input a target RNA sequence and (K - 1) sequences that are homologous to the target sequence and predicts a structure for the target sequence. For an input set of K sequences, predictions for each sequence are obtained by running CentroidHomfold K times with each of the sequences serving as the target sequence once with the remaining (K - 1) sequences as the homologous sequences. CentroidHomfold version 1.0 is used.
- 10.
Free energy minimization [19, 45] as implemented in RNAstructure version 4.5 is used for single sequence structure predictions.
Structure prediction accuracies of all the methods are evaluated over the 54 testing datasets. Some of the methods failed to complete on some of the datasets because of rather large memory requirements. These methods are therefore excluded from the reported results for the corresponding cases in the following description.
Comparison of Time and Memory Requirements
Computation time
Runtime (seconds) for | |||
---|---|---|---|
K = 3 | K = 5 | K = 10 | |
TurboFold-MEA | 136.75 | 277.9 | 517.0 |
RAF | 8.25 | 50.8 | 214.6 |
LocARNA | 746.44 | 2815.9 | 11395.8 |
CentroidAlifold | 2.0 | 3.7 | 6.8 |
RNAalifold | 0.2 | 0.3 | 0.6 |
MXScarna | 1.5 | 2.9 | 5.8 |
CentroidHomfold | 15.9 | 54.2 | 210.0 |
Memory usage
Memory Usage (Megabytes) for | |||
---|---|---|---|
K = 3 | K = 5 | K = 10 | |
TurboFold-MEA | 111.4 | 161.9 | 235.1 |
RAF | 184.1 | 381.1 | 518.2 |
LocARNA | 204.2 | 195.9 | 296.3 |
CentroidAlifold | 48.4 | 49.6 | 50.1 |
RNAalifold | 49.5 | 49.1 | 49.7 |
MXScarna | 47.0 | 46.9 | 47.1 |
CentroidHomfold | 52.6 | 55.6 | 51.2 |
Discussion
The computation of extrinsic information in TurboFold is similar to several previous approaches for combining homology information for multi-sequence alignment and structure prediction. For example, the method proposed in [55] approximates base pairing probabilities via a computation similar to the extrinsic information computation. TurboFold, however, is fundamentally different. Whereas the method in [55] is non-iterative and directly utilizes the approximated probabilities for structure prediction via a Nussinov style [56] dynamic programming algorithm, TurboFold iteratively updates the extrinsic information and recomputes probabilities of base pairing, alternating between these steps in order to refine the estimates of posterior base pairing probabilities. As shown in the Results Section, the iterative procedure offers a significant improvement over a single computation. Also, the consistency transformation [57] is utilized by LocARNA for re-estimating the alignment probabilities in the progressive alignment via a procedure similar to extrinsic information computation. This procedure, unlike the method in [55], updates only the probabilities of alignment and the structure predictions are not explicitly updated. LocARNA can, however, perform iterative refinement to update the predictions of structures and alignment. Another difference is that the other methods use posterior alignment probabilities whereas TurboFold uses posterior co-incidence probabilities. It was observed (data not shown) that the structure prediction accuracy of TurboFold decreases when posterior alignment probabilities are utilized for generating extrinsic information instead of posterior co-incidence probabilities. In addition to combining information from homologous sequences, the extrinsic information can be generated experimentally. For example, in [58], the ability to use chemical mapping data is integrated into single sequence free energy minimization where it contributes to the structure prediction as an experimentally derived extrinsic information and is utilized in a non-iterative manner.
The inverse similarity weighting (1 - ψ_{m,s}) in (2) is a good choice despite the fact that fact that, under this weighting, larger weights are assigned to highly diverged sequences can often not be aligned well. This is because, unlike methods that determine one alignment and incorporate it in jointly folding sequences, the alignment information in TurboFold is probabilistic and incorporated in the form of nucleotide co-incidence probabilities. For highly diverged sequences, these co-incidence probabilities are smaller in magnitude and diffused over a wider region. Though the inverse similarity weighting (1 - ψ_{m,s}) in (2) assigns larger weights to highly diverged sequences, they do not exercise a strong influence when the extrinsic information is computed by averaging across multiple sequences in (2). The experimental results for SRP sequences, whose predicted average pairwise identity is 42%, are in agreement with this observation. Compared with other methods TurboFold predictions provide the highest sensitivity.
The concept of iterative updates utilized in TurboFold is motivated by iterative error-correction coding methods in digital communications [61], especially Turbo decoding[28, 62]. For the case of two RNA homologs, based on the analogy with turbo decoding, the conceptual framework for iterative estimation of RNA secondary structures and alignments was previously introduced by the authors in [63], albeit without a practical realization and also with significant differences in details. Both TurboFold and Turbo decoding rely on multiple encodings of common underlying information, which the estimation (decoding) procedures seek to recover. In TurboFold a (largely) common secondary structure is "encoded" by nature in the form of multiple homologous sequences and the goal of the estimation is to recover this common secondary structure. In Turbo decoding, a common digital data stream is deliberately encoded by multiple, usually two, encoders prior to communication over a channel and the receiver seeks to recover the common digital data stream. Both problems benefit from iterative update procedures that are enabled by re-framing decoding or prediction in terms of estimating corresponding probabilities. Specifically, in TurboFold, the formulation of the RNA folding problem as a base pairing probability estimation problem, as opposed to the problem of estimating one or more consensus secondary structures, allows propagation of probabilistic information from one sequence to the other and iterative updates. It is also noteworthy that in TurboFold the extrinsic information is incorporated as a free energy modification in the partition function for estimating single sequence base pairing probabilities with minimal computational cost, which is analogous, in Turbo decoding, to the method for insertion of extrinsic information as a pseudo prior [62] in the decoding procedure for the recovery of a single encoded stream. There are also obvious differences between TurboFold and Turbo decoding. Whereas, in Turbo decoding, the encoding of the data is designed manually for explicitly enabling recovery at the receiver, there is no such explicit design for the multiple homologs that form the input to TurboFold. This apparent disadvantage is, however, offset in part by the fact that typically many more homologs are available for an ncRNA sequence for use in TurboFold whereas in Turbo decoding use of more than two encodings levies a cost in power and data rate that is usually not justified by relatively minor performance gains that these additional encodings enable.
The main limitation of TurboFold is its inability to predict sequence alignments that conform to the predicted secondary structures. In parallel with previously proposed iterative decoding of RNA structural alignment in [64], the most natural extension of TurboFold for prediction of sequence alignment is via an integration of a probabilistic model for alignment into the existing iterative structure prediction. A probabilistic model for alignment already exists in the hidden Markov model. The iterations, however, do not update the co-incidence probabilities of alignment. The integration of probabilistic alignment model into the iterative prediction is currently a future consideration.
Conclusion
TurboFold, a new method for secondary structure prediction for multiple homologous sequences, is presented in this paper. TurboFold estimates base pairing probabilities for each of the sequences via an iterative procedure that utilizes extrinsic information from other sequences and intrinsic information from a thermodynamic nearest neighbor model for RNA folding. Experimental results demonstrate that the iterative updates in TurboFold offer a significant improvement over both single sequence computations and over non-iterative multi-sequence computations of base pairing probabilities. The base pairing probability estimates from TurboFold outperform alternative multi-sequence methods for estimating base pairing probabilities. TurboFold can be downloaded, either as source code or precompiled binaries as part of the RNAstructure package for Microsoft Windows, at http://rna.urmc.rochester.edu.
TurboFold Algorithm
input : A set of K homologous RNA sequence , = {1,2, ..., K}.
output: Posterior base pairing probability estimates for each RNA sequence in the set.
begin
for m ← 1 to K do
for s ← 1 to K do
// Compute the alignment co-incidence probabilities and sequence identities via a hidden Markov pairwise sequence alignment model
Compute _{ c } Π^{(s,m)}and ψ_{ s,m };
end
end
// Iterate (η +1) times.
t ← 0;
while t ≤ η do
for m ← 1 to K do
// Compute extrinsic information for base pairing
if (t == 0) then
// Use uniform unity initialization for extrinsic information
;
else
Compute utilizing (details in Figures 2, 3, 4);
end
// Compute base pairing probabilities via modified partition function computation
Compute utilizing and nearest neighbor thermodynamic model;
end
t ← t + 1;
end
end
Algorithm 1: TurboFold: Iterative probabilistic structure prediction
Declarations
Acknowledgements
The authors thank the Center for Research Computing, University of Rochester, for making computation time available for performing the experiments and for technical assistance. This work was supported in part by NIH grant R01HG004002 to DHM. Authors were funded by the University of Rochester and by the National Institutes of Health to pay the open access publication charges for this article.
Authors’ Affiliations
References
- Stark BC, Kole R, Bowman EJ, Altman S: Ribonuclease P: An Enzyme with an Essential RNA Component. Proc Natl Acad Sci USA 1978, 75(8):3717. 10.1073/pnas.75.8.3717PubMed CentralView ArticlePubMedGoogle Scholar
- Cech TR, Zaug AJ, Grabowski PJ: In vitro splicing of the ribosomal RNA precursor of Tetrahymena: Involvement of a guanosine nucleotide in the excision of the intervening sequence. Cell 1981, 27(3):487–496. 10.1016/0092-8674(81)90390-1View ArticlePubMedGoogle Scholar
- Eddy SR: Non-coding RNA Genes and the modern RNA World. Nat Rev 2001, 2(12):919–929. 10.1038/35103511View ArticleGoogle Scholar
- Griffith-Jones S: Annotating Noncoding RNA Genes. Annu Rev Genom Human Genet 2007, 8: 279–98. 10.1146/annurev.genom.8.080706.092419View ArticleGoogle Scholar
- Mattick JS, Makunin IV: Non-coding RNA. Hum Mol Genet 2006, 15: 17–29. 10.1093/hmg/ddl046View ArticleGoogle Scholar
- Szymanski M, Barciszewska MZ, Zywicki M, Barciszewski J: Noncoding RNA Transcripts. J Appl Genet 2003, 44: 1–19.PubMedGoogle Scholar
- Dinger ME, Pang KC, Mercer TR, Mattick JS: Differentiating Protein-Coding and Noncoding RNA: Challenges and Ambiguities. PLoS Comput Biol 2008., 4(11):Google Scholar
- Birney E, Stamatoyannopoulos JA, Dutta A, Guigó R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE, et al.: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 2007, 447(7146):799–816. 10.1038/nature05874View ArticlePubMedGoogle Scholar
- Doudna JA, Cech TR: The chemical repertoire of natural ribozymes. Nature 2002, 418(6894):222–228. 10.1038/418222aView ArticlePubMedGoogle Scholar
- Eddy SR: Computational Genomics of Noncoding RNA Genes. Cell 2002, 109(2):137–140. 10.1016/S0092-8674(02)00727-4View ArticlePubMedGoogle Scholar
- Gorodkin J, Hofacker IL, Torarinsson E, Yao Z, Havgaard JH, Ruzzo WL: De novo prediction of structured RNAs from genomic sequences. Trends Biotechnol 2010, 28: 9–19. 10.1016/j.tibtech.2009.09.006View ArticlePubMedGoogle Scholar
- Rivas E, Eddy SR: Noncoding RNA gene detection using comparative sequence analysis. BMC Bioinformatics 2001, 2: 8. 10.1186/1471-2105-2-8PubMed CentralView ArticlePubMedGoogle Scholar
- Washietl S, Hofacker IL, Stadler PF: Fast and Reliable Prediction of noncoding RNAs. Proc Natl Acad Sci USA 2005, 102(7):2454–2459. 10.1073/pnas.0409169102PubMed CentralView ArticlePubMedGoogle Scholar
- Uzilov AV, Keegan JM, Mathews DH: Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change. BMC Bioinformatics 2006, 7: 173. 10.1186/1471-2105-7-173PubMed CentralView ArticlePubMedGoogle Scholar
- Turner DH, Sugimoto N, Freier SM: RNA structure prediction. Annual Review of Biophysics and Biophysical Chemistry 1988, 17: 167–192. 10.1146/annurev.bb.17.060188.001123View ArticlePubMedGoogle Scholar
- Mathews DH: Revolutions in RNA Secondary Structure Prediction. J Mol Biol 2006, 359(3):526–532. 10.1016/j.jmb.2006.01.067View ArticlePubMedGoogle Scholar
- Xia T, SantaLucia J Jr, Kierzek R, Schroeder SJ, Jiao X, Cox C, Turner DH: Thermodynamic Parameters for an Expanded nearest-neighbor model for formation of RNA duplexes with Watson-Crick pairs. Biochemistry 1998, 37(42):14719–14735. 10.1021/bi9809425View ArticlePubMedGoogle Scholar
- Mathews DH, Sabina J, Zuker M, Turner DH: Expanded sequence dependence of thermodynamic parameters provides improved prediction of RNA secondary structure. J Mol Biol 1999, 288(5):911–940. 10.1006/jmbi.1999.2700View ArticlePubMedGoogle Scholar
- Mathews DH, Disney MD, Childs JL, Schroeder SJ, Zuker M, Turner DH: Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proc Natl Acad Sci USA 2004, 101(19):7287–7292. 10.1073/pnas.0401799101PubMed CentralView ArticlePubMedGoogle Scholar
- Mathews DH, Turner DH: Prediction of RNA secondary structure by free energy minimization. Curr Opin Struct Biol 2006, 16(3):270–278. 10.1016/j.sbi.2006.05.010View ArticlePubMedGoogle Scholar
- Pace NR, Thomas BC, Woese CR: Probing RNA structure, function and history by comparative analysis. In The RNA World. second edition. Cold Spring Harbor Laboratory Press; 1999:113–141.Google Scholar
- Bernhart SH, Hofacker IL: From consensus structure prediction to RNA gene finding. Brief Funct Genomic 2009, 8(6):461–471. 10.1093/bfgp/elp043View ArticleGoogle Scholar
- Harmanci AO, Sharma G, Mathews DH: Efficient Pairwise RNA Structure Prediction Using Probabilistic Alignment Constraints in Dynalign. BMC Bioinformatics 2007, 8: 130. 10.1186/1471-2105-8-130PubMed CentralView ArticlePubMedGoogle Scholar
- Knudsen B, Hein J: Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acids Research 2003, 31(13):3423–3428. 10.1093/nar/gkg614PubMed CentralView ArticlePubMedGoogle Scholar
- Do CB, Woods DA, Batzoglou S: CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics 2006, 22(14):90–98. 10.1093/bioinformatics/btl246View ArticleGoogle Scholar
- Lu ZJ, Gloor JW, Mathews DH: Improved RNA secondary structure prediction by maximizing expected pair accuracy. RNA 2009, 15(10):1805–1813. 10.1261/rna.1643609PubMed CentralView ArticlePubMedGoogle Scholar
- Do CB, Foo CS, Batzoglou S: A max-margin model for efficient simultaneous alignment and folding of RNA sequences. Bioinformatics 2008, 24(13):68–76. 10.1093/bioinformatics/btn177View ArticleGoogle Scholar
- Berrou C, Glavieux A, Thitimajshima P: Near Shannon limit error-correcting coding and decoding: Turbo-codes. Proc IEEE Intl Conf Communications, Geneva, Switzerland 1993, 2: 1064–1070.Google Scholar
- Durbin R, Eddy SR, Krogh A, Mitchison G: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge, UK: Cambridge University Press; 1999.Google Scholar
- McCaskill JS: The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers 1990, 29(6–7):1105–1119. 10.1002/bip.360290621View ArticlePubMedGoogle Scholar
- Mathews DH: Using an RNA secondary structure partition function to determine confidence in base pairs predicted by free energy minimization. RNA 2004, 10(8):1178–1190. 10.1261/rna.7650904PubMed CentralView ArticlePubMedGoogle Scholar
- Sankoff D: Simultaneous Solution of RNA Folding, Alignment and Protosequence Problems. SIAM J App Math 1985, 45(5):810–825. 10.1137/0145048View ArticleGoogle Scholar
- Harmanci AO, Sharma G, Mathews DH: PARTS: Probabilistic Alignment for RNA joinT Secondary Structure Prediction. Nucleic Acids Res 2008, 36(7):2406–2417. 10.1093/nar/gkn043PubMed CentralView ArticlePubMedGoogle Scholar
- Harmanci AO, Sharma G, Mathews DH: Stochastic Sampling of the RNA Structural Alignment Space. Nucleic Acids Research 2009, 37(12):4063–4075. 10.1093/nar/gkp276PubMed CentralView ArticlePubMedGoogle Scholar
- Sprinzl M, Horn C, Brown M, Ioudovitch A, Steinberg S: Compilation of tRNA sequences and sequences of tRNA genes. Nucleic Acids Res 1998, 26: 148–153. 10.1093/nar/26.1.148PubMed CentralView ArticlePubMedGoogle Scholar
- Szymanski M, Barciszewska MZ, Barciszewski J, Erdmann VA: 5S ribosomal RNA database Y2K. Nucleic Acids Res 2000, 28: 166–167. 10.1093/nar/28.1.166PubMed CentralView ArticlePubMedGoogle Scholar
- Brown JW: The Ribonuclease P Database. Nucleic Acids Res 1999, 27: 314. 10.1093/nar/27.1.314PubMed CentralView ArticlePubMedGoogle Scholar
- Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG: ClustalW and ClustalX version 2. Bioinformatics 2007, 23(21):2947–2948. 10.1093/bioinformatics/btm404View ArticlePubMedGoogle Scholar
- Zwieb C, Wower I, Wower J: Comparative sequence analysis of tmRNA. Nucleic acids research 1999, 27(10):2063–2071. 10.1093/nar/27.10.2063PubMed CentralView ArticlePubMedGoogle Scholar
- Zwieb C, Wower J: tmRDB (tmRNA database). Nucleic acids research 2000, 28: 169. 10.1093/nar/28.1.169PubMed CentralView ArticlePubMedGoogle Scholar
- Chen JL, Blasco MA, Greider CW: Secondary structure of vertebrate telomerase RNA. Cell 2000, 100(5):503–514. 10.1016/S0092-8674(00)80687-XView ArticlePubMedGoogle Scholar
- Andersen ES, Rosenblad MA, Larsen N, Westergaard JC, Burks J, Wower IK, Wower J, Gorodkin J, Samuelsson T, Zwieb C: The tmRDB and SRPDB Resources. Nucleic Acids Res 2006, 34: 163–168. 10.1093/nar/gkj142View ArticleGoogle Scholar
- Will S, Reiche K, Hofacker IL, Stadler PF, Backofen R: Inferring Noncoding RNA Families and Classes by Means of Genome-Scale Structure-Based Clustering. PLoS Comput Biol 2007, 3(4):680–691.View ArticleGoogle Scholar
- Bernhart SH, Hofacker IL, Will S, Gruber AR, Stadler PF: RNAalifold: improved consensus structure prediction for RNA alignments. BMC Bioinformatics 2008, 9: 474. 10.1186/1471-2105-9-474PubMed CentralView ArticlePubMedGoogle Scholar
- Reuter JS, Mathews DH: RNAstructure: software for RNA secondary structure prediction and analysis. BMC Bioinformatics 2010, 11: 129. 10.1186/1471-2105-11-129PubMed CentralView ArticlePubMedGoogle Scholar
- Do C, Gross S, Batzoglou S: CONTRAlign: discriminative training for protein sequence alignment. In Research in Computational Molecular Biology. Springer; 2006:160–174.View ArticleGoogle Scholar
- Hamada M, Kiryu H, Sato K, Mituyama T, Asai K: Prediction of RNA secondary structure using generalized centroid estimators. Bioinformatics 2009, 25(4):465. 10.1093/bioinformatics/btn601View ArticlePubMedGoogle Scholar
- Hamada M, Sato K, Asai K: Improving the accuracy of predicting secondary structure for aligned RNA sequences. Nucleic Acids Res 2011, 39(2):393–402. 10.1093/nar/gkq792PubMed CentralView ArticlePubMedGoogle Scholar
- Xu X, Ji Y, Stormo GD: RNA Sampler: a new sampling based algorithm for common RNA secondary structure prediction and structural alignment. Bioinformatics 2007, 23(15):1883–1891. 10.1093/bioinformatics/btm272View ArticlePubMedGoogle Scholar
- Reeder J, Giegerich R: Consensus Shapes: An Alternative to the Sankoff Algorithm for RNA Consensus Structure Prediction. Bioinformatics 2005, 21(17):3516–3523. 10.1093/bioinformatics/bti577View ArticlePubMedGoogle Scholar
- Torarinsson E, Havgaard JH, Gorodkin J: Multiple structural alignment and clustering of RNA sequences. Bioinformatics 2007, 23(8):926–932. 10.1093/bioinformatics/btm049View ArticlePubMedGoogle Scholar
- Havgaard JH, Torarinsson E, Gorodkin J: Fast pairwise structural RNA alignments by pruning of the dynamical programming matrix. PLoS Comput Biol 2007, 3(10):1896–1908.View ArticlePubMedGoogle Scholar
- Lindgreen S, Gardner PP, Krogh A: MASTR: multiple alignment and structure prediction of non-coding RNAs using simulated annealing. Bioinformatics 2007, 15(23):3304–11.View ArticleGoogle Scholar
- Tabei Y, Kiryu H, Kin T, Asai K: A fast structural multiple alignment method for long RNA sequences. BMC Bioinformatics 2008, 9: 33. 10.1186/1471-2105-9-33PubMed CentralView ArticlePubMedGoogle Scholar
- Hamada M, Sato K, Kiryu H, Mituyama T, Asai K: Predictions of RNA secondary structure by combining homologous sequence information. Bioinformatics 2009, 25(12):330–338. 10.1093/bioinformatics/btp228View ArticleGoogle Scholar
- Nussinov R, Pieczenik G, Griggs JR, Kleitman DJ: Algorithms for Loop Matching. Journal of Applied Mathematics 1978, 35: 68–82.Google Scholar
- Do CB, Mahabhashyam MSP, Brudno M, Batzoglou S: ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Research 2005, 15: 330–340. 10.1101/gr.2821705PubMed CentralView ArticlePubMedGoogle Scholar
- Deigan KE, Li TW, Mathews DH, Weeks KM: Accurate SHAPE-directed RNA structure determination. Proc Natl Acad Sci USA 2009, 106: 97–102. 10.1073/pnas.0806929106PubMed CentralView ArticlePubMedGoogle Scholar
- Steffen P, Voß B, Rehmsmeier M, Giegerich R: RNAshapes: an integrated RNA analysis package based on abstract shapes. Bioinformatics 2006, 22(4):500–503. 10.1093/bioinformatics/btk010View ArticlePubMedGoogle Scholar
- Brown JW, Birmingham A, Griffiths PE, Jossinet F, Kachouri-Lafond R, Knight R, Lang BF, Leontis N, Steger G, Stombaugh J, et al.: The RNA structure alignment ontology. RNA 2009, 15(9):1623–1631. 10.1261/rna.1601409PubMed CentralView ArticlePubMedGoogle Scholar
- Hagenauer J: Iterative decoding of binary block and convolutional codes. IEEE Trans on Inform Theory 1996, IT-42(2):429–445.View ArticleGoogle Scholar
- Regalia PA: Iterative Decoding of Concatenated Codes: A Tutorial. EURASIP Journ Appl Sig Proc 2005, 6: 762–774.View ArticleGoogle Scholar
- Harmanci AO, Sharma G, Mathews DH: Toward Turbo Decoding of RNA Secondary Structure. Proc IEEE Intl Conf Acoustics Speech and Sig Proc 2007, I: 365–368.Google Scholar
- Harmanci AO, Sharma G, Mathews DH: Probabilistic Structural Alignment of RNA Sequences. Proc IEEE Intl Conf Acoustics Speech and Sig Proc, Las Vegas, Nevada 2008, 645–648.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.