A probabilistic model for the evolution of RNA structure
- Ian Holmes^{1, 2}Email author
Affiliated with
DOI: 10.1186/1471-2105-5-166
© Holmes. 2004
Received: 27 July 2004
Accepted: 26 October 2004
Published: 26 October 2004
Advertisement
DOI: 10.1186/1471-2105-5-166
© Holmes. 2004
Received: 27 July 2004
Accepted: 26 October 2004
Published: 26 October 2004
For the purposes of finding and aligning noncoding RNA gene- and cis-regulatory elements in multiple-genome datasets, it is useful to be able to derive multi-sequence stochastic grammars (and hence multiple alignment algorithms) systematically, starting from hypotheses about the various kinds of random mutation event and their rates.
Here, we consider a highly simplified evolutionary model for RNA, called "The TKF91 Structure Tree" (following Thorne, Kishino and Felsenstein's 1991 model of sequence evolution with indels), which we have implemented for pairwise alignment as proof of principle for such an approach. The model, its strengths and its weaknesses are discussed with reference to four examples of functional ncRNA sequences: a riboswitch (guanine), a zipcode (nanos), a splicing factor (U4) and a ribozyme (RNase P). As shown by our visualisations of posterior probability matrices, the selected examples illustrate three different signatures of natural selection that are highly characteristic of ncRNA: (i) co-ordinated basepair substitutions, (ii) co-ordinated basepair indels and (iii) whole-stem indels.
Although all three types of mutation "event" are built into our model, events of type (i) and (ii) are found to be better modeled than events of type (iii). Nevertheless, we hypothesise from the model's performance on pairwise alignments that it would form an adequate basis for a prototype multiple alignment and genefinding tool.
One of the promises of comparative genomics is to annotate previously undetectable functional signals in genomic sequence, by identifying and characterising evolutionarily conserved elements. A principled way to extract such signals is by fitting the data to probabilistic models of the molecular evolutionary process. The logic runs as follows: suppose there are various kinds of conserved element x, y, z... (e.g. exons, bits of RNA, promoters, etc) that might explain an observed sequence homology. For each of these scenarios, we can construct a probabilistic model M _{ x }, M _{ y }, M _{ z }... and compare the likelihood of the observed data under each of these models. The model with the best fit indicates the type of functional element present in the sequence.
A groundbreaking example of how this probabilistic approach can be used is the QRNA program, designed as a comparative RNA gene predictor [1]. The three types of element considered by QRNA are noncoding RNA (called RNA), protein-coding exons (called COD for codon), and unidentified DNA homology (called OTH for other). The former (RNA) was modeled using a Pairwise Stochastic Context-Free Grammar (Pair SCFG); the latter two (COD and OTH) using Pairwise Hidden Markov Models (Pair HMMs). The noncoding RNA predictions generated a high yield of experimental hits, and offered an information-theoretic glimpse into a modern-day RNA world [2].
It is natural to consider how such an approach might be applied to a pairwise comparison where the evolutionary "distance" between the two sequences can vary. One approach, analogous to the BLOSUM series of BLAST matrices for proteins [3], is to partition a set of training alignments into an ad hoc number of bins based on the percentage sequence identity. Alignments in the same bin (i.e. having comparable sequence identity) then represent pairs of sequences at approximately equivalent distances. For example, the BLOSUM62 substitution matrix was estimated from pairwise alignments with at least 62% identity. This sort of approach is used by the RIBOSUM basepair substitution matrices developed for RSEARCH [4], recent versions of QRNA, and the stemloc program in the author's DART software package.
An alternative approach, analogous to the PAM series of BLAST matrices [5], is to treat the "distance" as a time measurement, by postulating an underlying evolutionary stochastic process or continuous-time Markov chain whose mutation rate parameters are constant over time (called stationarity in stochastic process theory). This evolutionary rate approach uses fewer parameters – and makes fuller use of the data – than the dividing-into-bins approach, since it postulates an infinitesimal generator for all time-scales of the process. For the PAM series, this generator takes the form of an instantaneous substitution rate matrix; for a primary-sequence model, the generator is a conditionally-normalized Pair HMM or transducer [6]; for an RNA secondary-structure model, we will see that the generator is a Pair SCFG; and so on. Furthermore, the evolutionary rate model is supremely compatible with likelihood-based phylogenetic methods [7]. It's therefore worth considering such evolutionary rate-based models, although (since they're trickier to analyse mathematically) they're less suited to quick software prototyping than the "bin-by-percent-ID" approach.
A computer program for simultaneous pairwise alignment and secondary structure prediction using the TKF91 Structure Tree has been developed in C++. The potential of the model for RNA sequence alignment has been demonstrated by testing the pairwise aligner on four functional elements from the RFAM database [14]: the purine riboswitch, the nanos translational control element, the U2 splicing factor and the bacterial nuclear RNase P gene. The TKF91 Structure Tree is a very simple evolutionary model lacking some "obvious" features, such as natural selection to favour the thermodynamically stable overlap of π-orbitals between adjacent stacked bases in RNA double helices. The fact that the model appears to work reasonably well, despite the exclusion of such features, suggests that very simple models of RNA evolution may turn out to be sufficient to uncover a surprisingly large proportion of RNA sequence homology.
We begin by reviewing the TKF91 model [10]. This model describes the evolution of a single sequence under the action of two kinds of mutation event: (i) point substitution events, which act on a single residue only; and (ii) single-residue indel events, which insert or delete a single residue. The rates of both types of event are independent of the neighboring sequence.
The TKF91 model, as defined by Thorne et al, is time-reversible. This has the implication, called the pulley principle by Felsenstein, that the position of an ancestral node in a phylogenetic tree can be slid around like a pulley without changing the likelihood of the observed data [7]. Aligning a pair of observed present-day sequences is therefore identical to aligning an ancestral sequence with its descendant, and we can talk about ancestor-descendant alignment without loss of generality.
The TKF91 model can be analysed algebraically [10], and the probability distribution function (PDF) over ancestor-descendant alignments can be expressed as a Pair HMM [13] and extended to multiple sequences (using a "Multiple HMM") [13]. While it is straightforward to define a more general "long indel" model allowing multi-residue deletions and insertions [8], the only Pair HMMs for this general model that have been described to date are approximations, inspired by the form of the TKF91 model: so far there is no exact Pair HMM solution of the long indel model [8, 9, 15, 16]. In this paper, we will not be considering such long-indel models.
The state of the TKF91 process is described by a TKF91 link sequence: a permanent immortal link at the left end of the sequence, followed by zero or more mortal links. Over time, mortal links can be deleted, and new mortal links can be inserted to the right of either immortal or mortal links. This can be treated as a birth-death process (λ _{0}, μ _{0}) with constant immigration (λ _{0}), where "births" are identified with single-link insertions occuring to the immediate right of the parent mortal link. and "immigration" with insertions immediately right of the immortal link.
A further site-independent labeling is introduced on mortal links using the singlet nucleotide alphabet, Ω = {A, C, G, U}. Each site's alphabet label evolves as an independent four-state reversible continuous-time Markov chain (RCTMC) with substitution rate R _{0}(i, j) from state i to state j. Labels for newly inserted mortal links are drawn from the equilibrium distribution p _{0}(i) of this substitution process. By reading off the labels of mortal links, the state of the TKF91 process can be equated to a sequence in Ω*.
The following functions of (λ _{ n }, μ _{ n }) arise in analyses of equilibrium and transition probabilities in the TKF91 model [10]. Here t is a time parameter.
Here exp(R _{ n } t) ≡ exp(A) is the exponential of the matrix with elements A _{ ij }= R _{ n }(i, j)t.
The meaning of the above functions is as follows. α _{ n }is the probability of non-deletion; β _{ n }, γ _{ n }are the probabilities of insertion, following (respectively) an insertion and a deletion; κ _{ n }is the probability of continuing the ancestral sequence; and M _{ n }(i, j) is the conditional substitution probability from i to j. Note (1 - γ _{ n })κ _{ n }(1 - α _{ n }) = β _{ n }(delete → delete and insert → insert transition probabilities are equal). Note also lim_{ t→∞} β _{ n }= κ _{ n }.
The equilibrium probability distribution over sequences in the TKF91 model is a geometric distribution with parameter κ _{0}. The residues at individual positions of the sequence are independently, identically distributed at equilibrium and are sampled from the equilibrum distribution of the point substitution process.
Various extensions to TKF91 have been proposed [8, 9, 15]. The most tractable kind of extension changes the meaning of a "link" but leaves the indel process on links intact [15]. Our RNA model is one such extension, allowing two different kinds of TKF91 model that can be mutually nested to form loop and stem regions.
Consider the following extension to the TKF91 model, which we call the TKF91 Structure Tree, and which is shown in Figures 1 and 2. This model uses the fact that an RNA secondary structure (excluding pseudoknots, kissing loops and other "tertiary" interactions) can be identified with a tree. The state of our stochastic process can thus be described by a rooted tree: every node in this tree is either a singlet, paired, loop or stem node. The tree can be broken into overlapping loop sequences and stem sequences, which correspond to strands of unpaired RNA (loops) or double helices of basepaired RNA (stems). Loops are allowed to contain unpaired nucleotides, and can also serve as a branching-off point for nested stems. Stems, on the other hand, are allowed to contain paired nucleotides, and are terminated by a loop (this reflects the smallest unit of RNA structure, which is a stem terminated by a loop). The tree is rooted by a loop sequence. The above description will now be made more precise.
The state of the TKF91 Structure Tree is described by a rooted tree where each node has degree ≤ 3.
There are four basic kinds of node in the tree: singlet, paired, loop and stem.
Singlet and paired nodes correspond to observable nucleotides. Singlet nodes (labeled from Ω) represent independently evolving nucleotides, as in TKF91. Paired nodes (labeled from Ω^{2}) represent covariant basepairs.
Loop and stem nodes determine the tree structure (Figure 1). Loop nodes (labeled L), of which the root node is one, are present at the beginning of loop sequences, which contain singlet and stem nodes and are written horizontally. Stem nodes (labeled S) are present at the beginning of stem sequences, which contain paired nodes, are terminated by a loop node, and are written vertically.
The set of loop and stem node labels is written Φ. The full set of node labels is Ω ∪ Ω^{2} ∪ Φ.
Φ = {L, S}
Ω = {A, C, G, U}
Ω^{2} = {AA, AC, AG, AU, CA, CC, CG, CU, GA, GC, GG, GU, UA, UC, UG, UU}
A loop sequence is very similar to a TKF91 link sequence: as with TKF91, we have a leftmost immortal loop link followed by zero or more mortal loop links. The mortal links are inserted and deleted with rates λ _{1} and μ _{1}, in the style of TKF91. Each link is also a node in the Structure Tree.
Links are labeled from Ω ∪ Φ: the immortal loop link is labeled L, while the mortal loop links are labeled from {A, C, G, U, S}. As with the TKF91 model, the alphabet labeling of each mortal link evolves as an independent five-state RCTMC with substitution rate R _{1}(i, j) from i to j and equilibrium probability p _{1}(i) of being in state i, plus the additional restriction that R _{1}(X, S) = R _{1}(S, X) = 0 for all X ∈ Ω: in other words, embedded stems can't interconvert with singlet nucleotides. See step 1 → 2 of Figure 2 for examples of single-nucleotide substitution in loop sequences, and steps 3 → 4 and 4 → 5 for single-nucleotide insertion and deletion.
The S-labeled links possess an independently evolving embedded stem sequence that can be considered to "nest" inside the loop sequence. If the S-link is deleted, then the embedded stem (and all its children) is deleted with it. Conversely, when a new S-link is inserted, it is inserted with a complete subtree that is sampled from the equilibrium distribution over Structure Trees. See steps 6 → 7 and 7 → 8 of Figure 2 for examples of substructure insertion and deletion.
Since a loop sequence is effectively a TKF91 sequence with a special "fifth nucleotide" character representing an embedded stem (the S link), it obeys the same statistics as a TKF91 sequence. In particular, the probability distribution over loop lengths at equilibrium is a geometric distribution with parameter κ _{1}.
A stem sequence is also derived from a TKF91 link sequence. Unlike the TKF91 link sequence or the loop sequence, however, it is written vertically (rather than horizontally) in Figure 1. It consists of a topmost immortal stem link, zero or more mortal stem links, and a bottommost, terminating loop link. Again, each link is also a node in the Structure Tree.
Links are labeled from Ω^{2} ∪ Φ: the immortal stem link is labeled S (this is the node in the parent loop sequence), the mortal links are labeled with the paired nucleotide alphabet Ω^{2} (each with an independent sixteen-state RCTMC modeling covariant pair substitution along RNA stems, with substitution rate matrix R _{2}(i, j) and equilibrium p _{2}(i)), and the terminating loop link is labeled L. The mortal stem links experience TKF91-style insertion and deletion with rates λ _{2} and μ _{2} (although, in the diagrammatic form of Figure 1, newly inserted links are placed immediately under their parent link, rather than immediately to the right). The terminating loop link L does not contribute to insertion or deletion (so is effectively immortal but inert) but possesses an independently evolving loop sequence. See step 2 → 3 of Figure 2 for examples of covariant basepair substitution in stem sequences, and step 5 → 6 for covariant basepair insertion and deletion.
Note that the immortal stem link, S, is only immortal from the point of view of the stem sequence beneath it. The S is itself a mortal link in a parent loop sequence, and may be deleted as that sequence evolves. In this event, the loop link L will also be deleted, along with all its children (step 7 → 8, Figure 2). Thus, the only truly immortal link is the loop node at the root of the Structure Tree, which has no parents to deal death from above.
As with the loop sequence, a stem sequence is effectively a TKF91 sequence with minor modifications, and it obeys the same statistics as a TKF91 sequence. The probability distribution over stem lengths at equilibrium is a geometric distribution with parameter κ _{2}.
Φ_{1234} = {L _{1}, L _{2}, L _{3}, L _{4}, S _{1}, S _{2}, S _{3}, S _{4}}
Ω_{ a }= {A _{ a }, C _{ a }, G _{ a }, U _{ a }}
Ω_{ d }= {A _{ d }, C _{ d }, G _{ d }, U _{ d }}
Dynamic programming alignment of sequences to these grammars has the typical complexity for single-sequence [17] and pairwise [18] SCFGs. That is, for Figure 5, the time complexity is O(L ^{3}) and the memory complexity O(L ^{2}), while for Figure 6, the time complexity is O(L ^{3} M ^{3}) and the memory complexity O(L ^{2} M ^{2}), where L and M are sequence lengths. This is also the complexity of the single-sequence and two-sequence Sankoff algorithm [19], for which SCFGs may be regarded as a probabilistic scoring scheme. The time and memory complexity may be reduced by the use of "banding" techniques [20, 21], that restrict the dynamic programming computation to the (typically) highest-scoring central diagonal band of the dynamic programming matrix, or by more flexible constraints on the DP iteration [18].
We now describe some transformations of Figures 5,6 performed before implementing the grammar parsers.
The presence in a grammar of "null cycles" – sequences of production rules which cause no net change – complicates the parsing algorithms for that grammar. Generally, null cycles are avoided by programmers designing SCFGs or HMMs for sequence analysis [17]. However, in the grammars derived automatically for the TKF91 Structure Tree, null cycles arise naturally due to the possibility of zero-length loop or stem sequences in the model.
Classes of degeneracy in the Structure Tree grammars. Permutations and combinations of these cycles are also possible.
Degeneracy | Figure | Nonterminal sequence | Rule sequence | Comment |
---|---|---|---|---|
Null cycle | 5 | L → S → L | 2,3,5 | |
6 | L _{1} → ... → L _{1} | 4,7,11 | via S _{1} L _{1} | |
5,32,30 | via S _{4} L _{1} | |||
L _{2} → ... → L _{2} | 17,27,25 | |||
L _{1} → L _{2} ... | 6,27,25 | |||
...L _{2} → L _{1} | 15,11,7 | via S _{1} L _{1} | ||
16,32,30 | via S _{4} L _{1} | |||
L _{3} → S _{3} → L _{3} | 24,25,27 | |||
L _{4} → S _{4} → L _{4} | 29,30,32 | |||
Loop bifurcation | 5 | L → LL | 2,5 | |
6 | L _{1} → L _{1} L _{1} | 4,11 | ||
L _{2} → L _{1} L _{1} | 15,11 | |||
L _{3} → L _{3} L _{3} | 24,27 | |||
L _{4} → L _{4} L _{4} | 29,32 | |||
Silent bulge | 5 | S → L → S | 5,3,2 | Cyclic permutation of L → S → L |
6 | S _{2} → L _{1} → S _{1} | 22,4,7 | ||
S _{1} → L _{1} → S _{1} | 11,4,7 | Cyclic permutation of L _{1} → S _{1} → L _{1} | ||
S _{3} → L _{3} → S _{3} | 27,24,25 | Cyclic permutation of L _{3} → S _{3} → L _{3} | ||
S _{4} → L _{4} → S _{4} | 32,29,30 | Cyclic permutation of L _{4} → S _{4} → L _{4} |
As well as null cycles, there are other undesirable degeneracies in the Structure Tree grammars. Grammatical degeneracy occurs when more than one parse has the same meaning, so parses are degenerate rather than unique. Most stochastic grammars useful for bioinformatics are degenerate in the sense that there are always many folds or alignments consistent with the observed sequence data; this sort of degeneracy is technically called ambiguity. We are more concerned with other forms of degeneracy, such as structural degeneracy (multiple parses denote a single pattern of basepairing) and alignment degeneracy (multiple parses denote a single alignment).
TKF91, in effect, skirts alignment degeneracy by assigning meaning to the ordering of deletions and insertions in an alignment, but alignment degeneracies arise in the Structure Tree model because there are multiple ways to delete and insert things (e.g. deleting a whole stem, versus deleting all its elements individually). There are also structural degeneracies arising from "silent" (i.e. non-emitting) loops or stems. In addition to the null cycles described above, these include (for the singlet grammar) the undesirable "loop bifurcation" L → LL and the "silent bulge" S → S (a null cycle). A full list of degeneracies for the singlet and pair grammars is shown in Table 1.
The null cycles all involve zero-length stems and can be broken (NB not marginalised; the likelihood is discarded) by adding extra nonterminals before the corresponding S _{ k }, copying all outgoing rules except the nonemitting S _{ k }→ L _{ k }. This also removes the loop bifurcations, but leaves silent bulges of the form S _{ k }→ . The silent bulges can be removed by adding nonterminals before L _{ k }, copying all outgoing rules except L _{ k }→ ε, changing to so as to prevent escape from without an unpaired emission, and adding new rules of the form to allow escape if there is a genuine bifurcation.
A more careful analysis, marginalising null cycles and silent bulges rather than simply ignoring them, is almost certainly possible.
The asymptotic complexity of the dynamic programming recursions implied by these grammars is unchanged by the transformation to DART form. For Figure 7, the time complexity is O(L ^{3}) and the memory complexity O(L ^{2}), while for Figure 8, the time complexity is O(L ^{3} M ^{3}) and the memory complexity O(L ^{2} M ^{2}), where L and M are sequence lengths. Again, the complexity may be reduced by the use of "banding" [20, 21] or other [18] constraints.
The Expectation Maximisation (EM) algorithm is often used for training BLOSUM-like models, e.g. estimating emission and transition probabilities for Pair HMMs [17] or Pair SCFGs [1]. It is also useful for training evolutionary rate models, which have roughly the same number of parameters and can make use of larger training sets (since the training data don't have to be "binned" by percent identity).
The EM algorithm for the TKF91 Structure Tree can be separated into two parts, one for the substitution process and one for the indel process. Earlier work [22] showed how to estimate the maximum-likelihood substitution rate matrix R _{ n }using the EM algorithm, given the following sufficient statistics:
, the expected number of insertions of state d;
, the expected number of aligned sites with ancestral state a and descendant state d.
A forthcoming paper describes how to estimate the maximum-likelihood indel rates λ _{ n }, μ _{ n }for a TKF91 model using the EM algorithm, given the following sufficient statistics:
, the expected number of deleted links not followed by an insertion;
, the expected number of surviving links not followed by an insertion;
, the expected number of deleted links followed by an insertion;
, the expected number of surviving links followed by an insertion.
We can calculate all the above update statistics simultaneously from data (the E-step) using a constrained version of the Inside-Outside algorithm [18] for the grammar in Figure 8, as follows. Assume the joint normalisation, P(d, a), and suppose that is the posterior expectation of the number of times rule m of Figure 8 was applied, as returned by the Inside-Outside algorithm. For emit rules, let be the expected number of times rule m was used to emit the specific nonterminals X, Y ... ∈ Ω. Then
The terms in parentheses are to be omitted if the conditional normalisation, P(d|a), is used.
The relationship between the expected insert and match usage , and the expected start, wait and transition usage of the previous work [22] is as follows
where is defined as in the previous work [22]
The pairwise aligner for the TKF91 Structure Tree is distributed as part of the DART package at the following URL:
The aligner is based on the Stochastic Context-Free Grammars (SCFGs) shown in Figures 7 and 8, as explained in the Methods section. The specific implementation uses a general Pair SCFG dynamic programming (DP) engine with accelerating heuristics, to be described in a later paper (manuscript in preparation).
For a given family, denote the two sequences in the family by A, B. The following computations were performed:
(1A), (1B) For each of the two sequences (A, B) taken individually, the secondary structure was predicted without the aid of comparative information from the other sequence, using the single-sequence SCFG of Figure 7.
(2) The two sequences (A, B) were then aligned using the TKF91 model, without making use of any model of RNA structure, using the Pair HMM of Figure 4.
(3) Finally, the two sequences (A, B) were aligned using the TKF91 Structure Tree model introduced in this paper, using the Pair SCFG of Figure 8.
This triangular dotplot illustrates the single-sequence structure prediction for sequence A of computation (1A). The pixel color at co-ordinates (x, y) represents the posterior probability that residues x and y of A are base-paired, in the absence of any information from sequence B
This triangular dotplot illustrates the single-sequence structure prediction for sequence B of computation (1B). The pixel color at co-ordinates (x, y) represents the posterior probability that residues x and y of B are base-paired, in the absence of any information from sequence A.
This rectangular dotplot illustrates the structure-free pairwise alignment of computation (2). The pixel color at co-ordinates (x, y) represents the posterior probability that residue x of A is homologous to residue y of B, in the absence of any structural information from the two sequences.
This triangular dotplot illustrates the comparative structure prediction for sequence A of computation (3). The pixel color at co-ordinates (x, y) represents the marginal posterior probability that residues x and y of A are base-paired, summed over all alignments to sequence B.
This triangular dotplot illustrates the comparative structure prediction for sequence B of computation (3). The pixel color at co-ordinates (x, y) represents the marginal posterior probability that residues x and y of B are base-paired, summed over all alignments to sequence A.
This rectangular dotplot illustrates the structural pairwise alignment of computation (3). The pixel color at co-ordinates (x, y) represents the marginal posterior probability that residue x of B is homologous to residue y of A, summed over all secondary structures of sequences A and B. Note that the orientation of this plot is flipped (reflected about the diagonal axis) relative to (c).
In addition, the "true" (published) structures and alignments are overlaid on the computational results as blue squares (or blue dots, on the larger images).
The rate parameters used for the TKF91 Structure Tree were obtained by maximum likelihood training from a random selection of structurally-annotated RFAM alignments, as follows:
λ _{1} = 0.027, μ _{1} = 0.03; λ _{2} = 0.007, μ _{2} = 0.01; p _{1}(S) = 0.01. The substitution rate parameters were taken from the PFOLD program [12]. The evolutionary "time" between the two sequences was set to 1 in each case. In the case of the RNase P and U2 genes, the DP algorithms were constrained to a band along the main diagonal of the DP matrix; this constraint was imposed due to limited memory. No such constraint was imposed for the purine riboswitch computations.
The purine riboswitches are a class of cis-acting regulatory elements that specifically bind adenine or guanine and are involved in the post-translational regulation of purine transport and biosynthesis [23]. Figure 9 shows the alignment of the two riboswitch sequences, from Bacillus halodurans and Streptococcus pneumoniae, which was taken from the RFAM database [14]. The two secondary structures of this pair are exactly identical, although the primary sequences are considerably diverged.
Figure 14 shows the posterior dotplots for the purine riboswitches. This is an easy case for the model, with a strong signal and few gaps. The TKF91 Structure Tree grammar (Figure 8) is able to identify all stems correctly, with some slight uncertainty over the alignment. The primary-sequence TKF91 grammar (Figure 4) is similarly able to find the correct alignment, although the singlet folding grammar (Figure 7) has difficulty resolving the stems (note that this grammar does not model basepair stacking effects).
The translational control element (TCE) is a regulatory sequence from the 3' untranslated region of the Drosophila nanos gene, involved in post-translational degradation and transport of nanos mRNA, which localises to the posterior of oocytes and other cell lines [24]. Figure 10 shows the alignment of the two TCE sequences, from Drosophila virilis and Drosophila melanogaster, which was curated by hand from the description by Gavis et al [24]. The two secondary structures of this pair share the same overall bifurcating-stem structure, but with some changes in stem length.
Figure 15 shows the posterior dotplots for the nanos TCEs. This time the TKF91 Structure Tree grammar (Figure 8) does considerably better than the primary-sequence TKF91 grammar (Figure 4) at finding the correct alignment, probably due to the gaps at the end (the TKF91 grammar in Figure 4 is effectively a global aligner with linear gaps, so that the alignments it produces tend to form a continuous line from corner to corner of the DP matrix, without major discontinuities, as can be seen in region (c) of Figure 15). Again, the Structure Tree does much better than the singlet folding grammar (Figure 7) at distinguishing real stems from background noise, since it is able to use covariation of basepaired residues as a clue.
The U2 small nuclear RNA recognizes and binds the branch point region of introns in pre-mRNA [25]. Figure 11 shows the alignment of the two splicing factors, from Tetrahymena thermophila and Leptomonas collosoma, was taken from the RFAM database [14]. The secondary structures of the two sequences are quite similar, but the Leptomonas U2 has a deletion of roughly 35 bp that eliminates an entire stem (stems 4–6 on Figure 11).
Figure 16 shows the posterior dotplots for the U2 snRNAs. As before, the Structure Tree's stem predictions (regions (d) and (e), above the main diagonal of Figure 16) are far more specific than the singlet grammar's predictions (regions (a) and (b), below the main diagonal). The primary-sequence TKF91 grammar (Figure 4) is, again, hampered by its global alignment and linear gap penalty, and the alignment in region (c) is stretched and also uncertain. However, the Pair SCFG (Figure 8) manages to identify the 35-bp deletion and correctly finds stem 4 of Figure 11, though stems 5–6 have a lower probability (when predicting the structure of this deleted region, the Pair SCFG is unable to use covariation and must rely on basepairing information alone).
Nuclear RNase P is a class of endoribonuclease ribozyme involved in the production of mature 5' ends of transfer RNAs during tRNA biosynthesis [26]. Figure 12 shows the alignment of the two ribozyme sequences, from Pichia canadensis and Clavispora opuntiae, which was taken from the RFAM database [14]. The secondary structures of the two sequences are quite different, with major change in stem length and deletion of whole stem structures, characteristic of this gene family (stems 0–2 and 8–9 of Figure 12). Figure 17 shows the posterior dotplots for the RNase P genes. This family is one of the most mutable in RFAM, and the TKF91 Structure Tree performs poorly on this case. Both the Pair HMM (Figure 4; region (c) of Figure 17) and the Pair SCFG (Figure 8; region (f) of Figure 17) get the alignment almost entirely wrong, except for a region toward the 3' end that doesn't contain any stems (the region just before stem 8 of Figure 12). As a consequence, the Pair SCFG also fails to predict any stems correctly; the singlet SCFG (Figure 7) does no better. Region (f) of Figure 17 displays the continuous-line alignment from corner-to-corner, that is characteristic of global aligners with linear gaps: unlike the case of the U2 alignment, the structural signal here is insufficient to compensate for the indel-modeling deficiencies of the TKF91 Structure Tree.
The log-odds score of the "true" alignment (Figure 12) under the Structure Tree model is highly negative (-547 bits), suggesting that the model is poorly adapted for this example. Compare this with the scores for the previous examples: Figure 9 scored 2 bits, Figure 10 scored -82 bits and Figure 11 scored 35 bits. The low score for the nanos TCEs (Figure 10) was due primarily to the deletions in the outermost stem; the score rose to -5 bits with judicious trimming and careful choice of the "time" parameter.
We have described a reversible continuous-time Markov chain, called the "TKF91 Structure Tree", that describes both (i) covariant substitutions and indels in RNA sequence contingent upon a particular secondary structure, and (ii) changes in the underlying RNA secondary structure, corresponding to gain and loss of substructures. A pairwise alignment algorithm based on the model has been implemented in C++ and tested on four homologous pairs of RNA functional element from RFAM [14]. As with the TKF91 model on which the TKF91 Structure Tree is based [10], it should be possible, systematically, to design corresponding algorithms for multiple sequences, using either exhaustive dynamic programming [6, 27] or Markov Chain Monte Carlo [13].
It should be noted that the present implementation of the TKF91 Structure Tree is not designed to be a direct competitor to programs like FOLDALIGN [20], DYNALIGN [21] or CARNAC [28]. Such pairwise alignment programs are optimized for criteria like alignment accuracy and sensitivity. The TKF91 Structure Tree, on the other hand, was designed as an expository evolutionary model, ultimately aimed at phylogenetic analysis of multiple RNA sequences in an evolutionary likelihood context. The pairwise alignment program reported in this paper was implemented to demonstrate the potential of this evolutionary model, rather than for use as a practical alignment tool. The author's STEMLOC program, which is similarly based on Pair SCFGs, has been optimized for practical applications (preferring short-term performance advantages over long-term design considerations) and may be freely downloaded from http://www.biowiki.org/.
The results of our tests on pairwise alignments from RFAM reveal the strengths and weaknesses of our model. When RNA structure is very strongly conserved and indels are few, as with the purine riboswitches selected for this comparison (Figure 14), the TKF91 Structure Tree performs well at both structure prediction and alignment. On such alignments, the model is expected to be similar to PFOLD [12], which uses an SCFG and an evolutionary substitution model but lacks an evolutionary treatment of gaps. When the alignment has numerous indels in loops and stems, as in the selected nanos TCEs (15), or even minor rearrangements of structure, as in the selected U2 splice factors (16), the Structure Tree still seems to work well. However, beyond a certain level of structural change, as in the selected RNase P alignment (17), the model performs poorly and leaves considerable room for improvement.
In view of the room for improvement, we can identify a number of weaknesses of the TKF91 Structure Tree that could be improved in future models:
Sources of degeneracy such as zero-length stems and loops were removed "by hand" from the Pair SCFG (Table 1). These degeneracies could have beeen specifically excluded from the evolutionary model, but with the apparent cost of making an exact solution much harder to find. One might expect the nondegenerate grammars of Figures 7 and 8 to approximate the transition probabilities of such a nondegenerate model.
Indel rates for whole stems/multistems are same as for unpaired residues. In nature, stem gain and loss is much slower than unpaired residue insertion/deletion, since the former is a structural change while the latter is not.
Multiple-residue indel events, and hence affine gap penalties, are not allowed. Again, the poor performance on the RNase P alignment may in part be due to this: the alignment generated has many small gaps scattered throughout, whereas the "true" alignment has fewer, longer gaps. This is a characteristic artefact of using a point indel model (linear gap penalty) where a multi-residue indel model (affine penalty) would be more appropriate.
Stems cannot be deleted without deleting all their "children" as well (i.e. all stems nested inside). Empirical inspection of alignments in RFAM, however, reveals many structures where an outer stem has been deleted or truncated, while inner stems are preserved. Again, perhaps an affine gap penalty for covariant indels (i.e. indels in stems) would address this. Alternatively, one might contrive some kind of "ragged-end" local alignment model, e.g. by embedding the finite TKF91 Structure Tree in an infinite, unobserved Structure Tree (c.f. [8]), though this may not be the ideal way to model such effects.
The equilibrium distribution over structures is highly simplified. For example, there is currently no modeling of fine-scale energetics such as basepair stacking propensities due to π-orbital conjugation. Mathematically, the complexities of modeling such effects are somewhat similar to those involved in modeling nearest-neighbor substitution biases in DNA (such as methylation-induced CpG deamination). Since recent progress has been made with such models [29, 30] we might eventually expect inclusion of stacking effects in models of covariant RNA substitution, as well.
Bulges cannot be inserted into stems, except via the following awkward mechanism: the insertion of a bulge into a stem requires the pre-existence of a null S → L → S transition, where the L is empty. To fix this, L nodes could be allowed in stem sequences, just as S nodes are allowed in loop sequences (in fact, one should probably introduce "left" and "right" L-nodes, corresponding to left & right bulges). However, this would increase the potential for degeneracies.
We have assumed that all stems and loops evolve at the same rate, whereas empirical inspection of RFAM of suggests otherwise. It is known that the analogous assumption in proteins (that all sites evolve at the same rate) can skew phylogenetic distance estimation [31], and perhaps a similar correction to the discretized gamma priors used in proteins could be applied here [32].
There is no special treatment of structural features such as triloops, tetraloops, triple-A platforms, U-turns and the like. Such features are often observed to be evolutionary conserved [33, 34] and seem likely to be involved in intermolecular interactions [35, 36]. It would be relatively easy to incorporate such features into the TKF91 Structure Tree, as special classes of L- or S-branch.
While the lengths of stem sequences are geometrically distributed in the TKF91 Structure Tree, due to their roots in the TKF91 model, empirical observations of real RNA structures suggest that real stem lengths follow a fairly tight length distribution. Such approximations in modeling stem lengths could conceivably contribute to poorer performance of the model. (In practise, we have not observed unnaturally long stems in the output of the TKF91 Structure Tree aligner, but the existence of a long, perfect inverted repeat in the sequence could conceivably bring out this problem.)
Despite these drawbacks, the results of our preliminary benchmark suggest that the TKF91 Structure Tree may be useful for aligning (at least the better-conserved) RNA functional elements. Given the recent growth of RNA sequence and structure databases such as RFAM [14] and SCOR [34], it would be interesting to carry out a broad-scale, empirical study of the mutations of RNA structures. This could then be used as a starting point for systematically designing and benchmarking an improved evolutionary model for RNA. In the meantime, the results presented here suggest new ways of designing evolutionary grammars that recognise higher-level structural change as well as point substitutions and indels, offering new ways of using high-throughput comparative sequencing to interpret the contents of genomes.
This manuscript has evolved over the course of discussions with Sam Griffiths-Jones, Alex Bateman, Sean Eddy, Elena Rivas, Eric Westhof, Bjarne Knudsen, Kushal Chakrabarti, Jotun Hein, Gerton Lunter and David Haussler. The author thanks two anonymous reviewers for helpful criticism. The work was partly developed during a workshop in Benasque, Spain organised by Elena Rivas and Eric Westhof in 2003. The work was partially supported by grants from the UK's EPSRC (code HAMJW) and MRC (code HAMKA).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.