A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure

Background Covariance models (CMs) are probabilistic models of RNA secondary structure, analogous to profile hidden Markov models of linear sequence. The dynamic programming algorithm for aligning a CM to an RNA sequence of length N is O(N3) in memory. This is only practical for small RNAs. Results I describe a divide and conquer variant of the alignment algorithm that is analogous to memory-efficient Myers/Miller dynamic programming algorithms for linear sequence alignment. The new algorithm has an O(N2 log N) memory complexity, at the expense of a small constant factor in time. Conclusions Optimal ribosomal RNA structural alignments that previously required up to 150 GB of memory now require less than 270 MB.


Background
There are a growing number of RNA gene families and RNA motifs [1,2]. Many (though not all) RNAs conserve a base-paired RNA secondary structure. Computational analyses of RNA sequence families are more powerful if they take into account both primary sequence and secondary structure consensus [3,4]. Some excellent approaches have been developed for database searching with RNA secondary structure consensus patterns. Exact-and approximate-match pattern searches (analogous to PROSITE patterns for proteins) have been extended to allow patterns to specify long-range base pairing constraints [5,6]. In several cases, specialized programs have been developed to recognize specific RNA structures [4] -for example, programs exist for detecting transfer RNA genes [7][8][9], group I catalytic introns [10], and small nucleolar RNAs [11,12]. All of these approaches, though powerful, lack generality, and they require ex-pert knowledge about each particular RNA family of interest.
In primary sequence analysis, the most useful analysis techniques are general primary sequence alignment algorithms with probabilistically based scoring systems -for example, the BLAST [13], FASTA [14], or CLUSTALW [15] algorithms, and the PAM [16] or BLOSUM [17] score matrices. Unlike specialized programs, a general alignment algorithm can be applied to find homologs of any query sequence(s). Unlike pattern searches, which give yes/no answers for whether a candidate sequence is a match, a scoring system gives a meaningful score that allows ranking candidate hits by their statistical significance. It is of interest to develop general alignment algorithms for RNA secondary structures.
The problem I consider here is as follows. I am given a multiple alignment of an RNA sequence family for which I know the consensus secondary structure. I want to search a sequence database for homologs that significantly match the sequence and structure of my query. The sequence analysis analogue is the use of profile hidden Markov models (profile HMMs) to model multiple alignments of conserved protein domains, and to discover new homologues in sequence databases [18,19]. For instance, if we had an RNA structure equivalent of the HMMER profile HMM program suite [http://hmmer.wustl.edu/] it would be possible to develop and efficiently maintain databases of conserved RNA structures and multiple alignments, analogous to the Pfam or SMART databases of conserved protein domains [20,21].
Stochastic context free grammar (SCFG) algorithms provide a general approach to RNA structure alignment [22][23][24]. SCFGs allow the strong pairwise residue correlations in non-pseudoknotted RNA secondary structure to be taken into account in RNA alignments. SCFGs can be aligned to sequences using a dynamic programming algorithm that guarantees finding a mathematically optimal solution in polynomial time. SCFG alignment algorithms can be thought of as an extension of sequence alignment algorithms (particularly those with fully probabilistic, hidden Markov model formulations) into an additional dimension necessary to deal with 2D RNA secondary structure.
While SCFGs provide a natural mathematical framework for RNA secondary structure alignment problems, SCFG algorithms have high computational complexity that has impeded their practical application. Optimal SCFG-based structural alignment of an RNA structure to a sequence costs O(N 3 ) memory and O(N 4 ) time for a sequence of length N, compared to O(N 2 ) memory and time for sequence alignment algorithms. (Corpet and Michot described a program that implements a different general dynamic programming algorithm for RNA alignment; their algorithm solves the same problem but even less efficiently, requiring O(N 4 ) memory and O(N 5 ) time [25].) SCFG-based alignments of small structural RNAs are feasible. Using my COVE software [http://www.genetics.wustl.edu/eddy/software#cove], transfer RNA alignments (~75 nucleotides) take about 0.2 cpu second and 3 Mb of memory. Most genome centers now use an COVE-based search program, tRNAscan-SE, for annotating transfer RNA genes [9]. However, many larger RNAs of interest are OUTSIDE the capabilities of the standard SCFG alignment algorithm. Alignment of a small subunit (SSU) ribosomal RNA sequence to the SSU rRNA consensus structure would take about 23 GB of RAM and an hour of CPU time. Applying SCFG methods to RNAs this large has required clever heuristics, such as using a precalculation of confidently predicted regions of primary sequence alignment to strongly constrain which parts of the SCFG dynamic programming matrix need to be calculated [26].
The steep memory requirement remains a significant barrier to the practicality of SCFG algorithms. Notredame et al. pointed specifically to this problem [27]. They described RAGA, a program that uses a genetic algorithm (GA) to optimize a pairwise RNA alignment using an objective function that includes base pairing terms. Because GAs have an O(N) memory requirement, RAGA can find reasonable solutions for large RNA alignment problems, including ribosomal RNA alignments. A different memory-efficient approach has also been described [28,29]. However, both approaches are approximate and cannot guarantee a mathematically optimal solution, in contrast to the (mathematically) optimal but more expensive dynamic programming approaches.
Here, I introduce a dynamic programming solution to the problem of structural alignment of large RNAs. The central idea is a divide and conquer strategy. For linear sequence alignment, a divide and conquer algorithm was introduced by Hirschberg [30], an algorithm known in the computational biology community as the Myers/Miller algorithm [31]. (Ironically, at the time, dynamic programming methods for optimal sequence alignment were well known, but were considered impractical on 1970's era computers because of the "extreme" O(N 2 ) memory requirement.) Myers/Miller reduces the memory complexity of a dynamic programming sequence alignment algorithm from O(N 2 ) to O(N), at the cost of a roughly two-fold increase in CPU time. Here I show that a divide and conquer strategy can also be applied to the RNA structural alignment problem, greatly reducing the memory requirement of SCFG alignments and making optimal structural alignment of large RNAs possible. I will strictly be dealing with the problem of aligning a target sequence of unknown secondary structure to a query of known RNA structure. By "secondary structure" I mean nested (nonpseudoknotted) pairwise RNA secondary structure interactions, primarily Watson-Crick base pairs but also permitting noncanonical base pairs. This RNA structural alignment problem is different from the problem of aligning two known RNA secondary structures together [32], and from the problem of aligning two RNA sequences of unknown structure together under a secondary structure-aware scoring system [33][34][35][36][37].

Prelude: the simpler case of sequence alignment
The essential concepts of a divide and conquer alignment algorithm are most easily understood for the case of linear sequence alignment [30,31].
Dynamic programming (DP) algorithms for sequence alignment fill in an N × M DP matrix of scores F(i,j) for two sequences of lengths N and M (N <M) [38,39]. Each score F(i,j) is the score of the optimal alignment of prefix x 1 ..x i of one sequence to prefix y 1 ..y j of the other. These scores are calculated iteratively, e.g. for global (Needleman/Wunsch) alignment: At the end, F(N, M) contains the score of the optimal alignment. The alignment itself is recovered by tracing the individual optimal steps backwards through the matrix, starting from cell (N,M). The algorithm is O(NM) in both time and memory.
If we are only interested in the score, not the alignment itself, the whole F matrix does not have to be kept in memory. The iterative calculation only depends on the current and previous row of the matrix. Keeping two rows in memory suffices (in fact, a compulsively efficient implementation can get away with N + 1 cells). A score-only calculation can be done in O(N) space.
The fill stage of DP alignment algorithms may be run either forwards and backwards. We can just as easily calculate the optimal score B(i, j) of the best alignment of the suffix i + 1..N of sequence 1 to the suffix j + 1..M of sequence 2, until one obtains B(0,0), the overall optimal score -the same number as F(N,M).
The sum of F(i,j) and B(i,j) at any cell in the optimal path through the DP matrix is also the optimal overall alignment score. More generally, F(i,j) + B(i,j) at any cell (i,j) is the score of the best alignment that uses that cell. Therefore, since we know the optimal alignment must pass through any given row i somewhere, we can pick some row i in the middle of sequence x, run the forward calculation to i to obtain row F(i), run the backwards calculation back to i to get row B(i), and then find argmax j F(i,j)+B(i,j). Now I know the optimal alignment passes through cell (i,j). (For clarity, I am leaving out details of how indels and local alignments are handled.) This divides the alignment into two smaller alignment problems, and these smaller problems can themselves be subdivided by the same trick. Thus, the complete optimal alignment can be found by a recursive series of split point calculations. Although this seems laborious -each calculation is giving us only a single point in the alignment -if we choose our split row i to be in the middle, the size of the two smaller DP problems is decreased by about 4-fold at each split. A complete alignment thus costs only about times as much CPU time as doing the alignment in a single DP matrix calculation, but the algorithm is O(N) in memory.
A standard dynamic programming alignment algorithm for SCFGs is the Cocke-Younger-Kasami (CYK) algorithm, which finds an optimal parse tree (e.g. alignment) for a model and a sequence [24,[40][41][42]. (CYK is usually described in the literature as a dynamic programming recognition algorithm for nonstochastic CFGs in Chomsky normal form, rather than as a dynamic programming parsing algorithm for SCFGs in any form. The use of the name "CYK" here is therefore a little imprecise [24].) CYK can be run in a memory-saving "score only" mode. The DP matrix for CYK can also be filled in two opposite directions -either "inside" or "outside", analogous to forward and backward DP matrix fills for linear sequence alignment. I will refer to these algorithms as CYK/inside and CYK/outside (or just inside and outside), but readers familiar with SCFG algorithms should not confuse them with the SCFG Inside and Outside algorithms [43,44] which sum over all possible parse trees rather than finding one optimal parse tree. I am always talking about the CYK algorithm in this paper, and by "inside" and "outside" I am only referring generically to the direction of the CYK DP calculation.
The CYK/inside and CYK/outside algorithms are not as nicely symmetrical as the forward and backward DP fills are in sequence alignment algorithms. The splitting procedure that one obtains does not generate identical types of subproblems, so the divide and conquer procedure for SCFG-based RNA alignment is not as obvious.

Definition and construction of a covariance model
The divide and conquer algorithm I will describe is specific for RNA "covariance models" (CMs). A covariance model is a profile stochastic context free grammar designed to model a consensus RNA secondary structure with position-specific scores [22,24]. My algorithm takes advantage of features of CMs that do not apply to SCFGs in general. Therefore I start with an introduction to what CMs are, how they correspond to a known RNA secondary structure, and how they are built and parameterized.
Definition of a stochastic context free grammar A stochastic context free grammar (SCFG) consists of the following: • M different nonterminals (here called states). I will use capital letters to refer to specific nonterminals; V and Y will be used to refer generically to unspecified nonterminals. • K different terminal symbols (e.g. the observable alphabet, a,c,g,u for RNA). I will use small letters a, b to refer generically to terminal symbols.
• a number of production rules of the form: V → γ, where γ can be any string of nonterminal and/or terminal symbols, including (as a special case) the empty string ε.
• Each production rule is associated with a probability, such that the sum of the production probabilities for any given nonterminal V is equal to 1.

SCFG productions allowed in covariance models
A covariance model is a specific repetitive "profile SCFG" architecture consisting of groups of model states that are associated with base pairs and single-stranded positions in an RNA secondary structure consensus. A covariance model has seven types of states and production rules (Table 1).
Each overall production probability is the independent product of an emission probability e v and a transition probability t v , both of which are position-dependent parameters that depend on the state v (analogous to hidden Markov models). For example, a particular pair (P) state v produces two correlated letters a and b (e.g. one of 16 possible base pairs) with probability e v (a, b) and transits to one of several possible new states Y of various types with probability t v (Y). A bifurcation (B) state splits into two new start (S) states with probability 1. The E state is a special case ε production that terminates a derivation.
A CM consists of many states of these seven basic types, each with its own emission and transition probability distributions, and its own set of states that it can transition to. Consensus base pairs will be modeled by P states, consensus single stranded residues by L and R states, insertions relative to the consensus by more L and R states, deletions relative to consensus by D states, and the branching topology of the RNA secondary structure by B, S, and E states. The procedure for starting from an input multiple alignment and determining how many states, what types of states, and how they are interconnected by transition probabilities is described next.
From consensus structural alignment to guide tree Figure 1 shows an example input file: a multiple sequence alignment of homologous RNAs, with a line describing the consensus RNA secondary structure. The first step of building a CM is to produce a binary guide tree of nodes representing the consensus secondary structure. The guide tree is a parse tree for the consensus structure, with nodes as nonterminals and alignment columns as terminals.
The guide tree has eight types of nodes (Table 2).
These consensus node types correspond closely with a CM's final state types. Each node will eventually contain one or more states. The guide tree deals with the consensus structure. For individual sequences, we will need to deal with insertions and deletions with respect to this consensus. The guide tree is the skeleton on which we will organize the CM. For example, a MATP node will contain a P-type state to model a consensus base pair; but it will also contain several other states to model infrequent insertions and deletions at or adjacent to this pair.
The input alignment is first used to construct a consensus secondary structure ( Figure 2) that defines which aligned columns will be ignored as non-consensus (and later modeled as insertions relative to the consensus), and which consensus alignment columns are base-paired to

Figure 1
An example RNA sequence family. Top: a toy multiple alignment of three sequences, with 28 total columns, 24 of which will be modeled as consensus positions. The [structure] line annotates the consensus secondary structure: > and < symbols mark base pairs, x's mark consensus single stranded positions, and .'s mark "insert" columns that will not be considered part of the consensus model. Bottom: the secondary structure of the "human" sequence. each other. Here I assume that both the structural annotation and the labeling of insert versus consensus columns is given in the input file, as shown in the line marked "[structure]" in the alignment in Figure 1. Alternatively, automatic methods might be employed. A consensus structure could be predicted from comparative analysis of the alignment [22,45,46]. The consensus columns could be chosen as those columns with less than a certain fraction of gap symbols, or by a maximum likelihood criterion, as is done for profile HMM construction [18,24].
Given the consensus structure, consensus base pairs are assigned to MATP nodes and consensus unpaired columns are assigned to MATL or MATR nodes. One ROOT node is used at the head of the tree. Multifurcation loops and/or multiple stems are dealt with by assigning one or more BIF nodes that branch to subtrees starting with BEGL or BEGR head nodes. (ROOT, BEGL, and BEGR start nodes are labeled differently because they will be expanded to different groups of states; this has to do with avoiding ambiguous parse trees for individual sequences, as described below.) Alignment columns that are considered to be insertions relative to the consensus structure are ignored at this stage.
In general there will be more than one possible guide tree for any given consensus structure. Almost all of this ambiguity is eliminated by three conventions: (1) MATL nodes are always used instead of MATR nodes where possible, for instance in hairpin loops; (2) in describing interior loops, MATL nodes are used before MATR nodes; and (3) BIF nodes are only invoked where necessary to explain branching secondary structure stems (as opposed to unnecessarily bifurcating in single stranded sequence). One source of ambiguity remains. In invoking a bifurcation to explain alignment columns i..j by two substructures on columns i..k and k + 1..j, there will be more than one possible choice of k if i..j is a multifurcation loop containing three or more stems. The choice of k impacts the performance of the divide and conquer algorithm; for optimal time performance, we will want bifurcations to split into roughly equal sized alignment problems, so I choose the k that makes i..k and k + 1..j as close to the same length as possible.
The result of this procedure is the guide tree. The nodes of the guide tree are numbered in preorder traversal (e.g. a recursion of "number the current node, visit its left child, visit its right child": thus parent nodes always have lower indices than their children). The guide tree corresponding to the input multiple alignment in Figure 1 is shown in Figure 2.
From guide tree to covariance model A CM must deal with insertions and deletions in individual sequences relative to the consensus structure. For example, for a consensus base pair, either partner may be deleted leaving a single unpaired residue, or the pair may be entirely deleted; additionally, there may be inserted nonconsensus residues between this pair and the next pair in the stem. Accordingly, each node in the master tree is expanded into one or more states in the CM as follows ( Table 3) Here we distinguish between consensus ("M", for "match") states and insert ("I") states. ML and IL, for example, are both L type states with L type productions, but they will have slightly different properties, as described below.

Figure 2
The structural alignment is converted to a guide tree. Left: the consensus secondary structure is derived from the annotated alignment in Figure 1. The states are grouped into a split set of 1-4 states (shown in brackets above) and an insert set of 0-2 insert states. The split set includes the main consensus state, which by convention is first. One and only one of the states in the split set must be visited in every parse tree (and this fact will be exploited by the divide and conquer algorithm). The insert state(s) are not obligately visited, and they have selftransitions, so they will be visited zero or more times in any given parse tree.
State transitions are then assigned as follows. For bifurcation nodes, the B state makes obligate transitions to the S states of the child BEGL and BEGR nodes. For other nodes, each state in a split set has a possible transition to every insert state in the same node, and to every state in the split set of the next node. An IL state makes a transition to itself, to the IR state in the same node (if present), and to every state in the split set of the next node. An IR state makes a transition to itself and to every state in the split set of the next node.
This arrangement of transitions guarantees that (given the guide tree) there is unambiguously one and only one parse tree for any given individual structure. This is important. The algorithm will find a maximum likelihood parse tree for a given sequence, and we wish to interpret this result as a maximum likelihood structure, so there must be a one to one relationship between parse trees and secondary structures [47].
The final CM is an array of M states, connected as a directed graph by transitions t v (y) (or probability 1 transitions v → (y,z) for bifurcations) with the states numbered such that (y,z) ≥ v. There are no cycles in the directed graph other than cycles of length one (e.g. the self-transitions of the insert states). We can think of the CM as an array of states in which all transition dependencies run in one direction; we can do an iterative dynamic programming calculation through the model states starting with the last numbered end state M and ending in the root state 1. An example CM, corresponding to the input alignment of Figure 1, is shown in Figure 3.
As a convenient side effect of the construction procedure, it is guaranteed that the transitions from any state are to a contiguous set of child states, so the transitions for state v may be kept as an offset and a count. For example, in Figure 3, state 12 (an MP) connects to states 16,17,18,19,20, and 21. We can store this as an offset of 4 to the first connected state, and a total count of 6 connected states. We know that the offset is the distance to the next nonsplit state in the current node; we also know that the count is equal to the number of insert states in the current node, plus the number of split set states in the next node. These properties make establishing the connectivity of the CM trivial. Similarly, all the parents of any given state are also contiguously numbered, and can be determined analogously. We are also guaranteed that the states in a split set are numbered contiguously. This contiguity is exploited by the divide and conquer implementation.

Parameterization
Using the guide tree and the final CM, each individual sequence in the input multiple alignment can be converted unambiguously to a CM parse tree, as shown in Figure 4. Counts for observed state transitions and singlet/pair emissions are then collected from these parse trees. The observed counts are converted to transition and emission probabilities by standard procedures. I calculate maximum a posteriori parameters, using Dirichlet priors.

Comparison to profile HMMs
The relationship between an SCFG and a covariance model is analogous to the relationship of hidden Markov models (HMMs) and profile HMMs for modeling multiple sequence alignments [18,19,24]. A comparison may be instructive to readers familiar with profile HMMs. A profile HMM is a repetitive HMM architecture that associates each consensus column of a multiple alignment with a single type of model node -a MATL node, in the above   notation. Each node contains a "match", "delete", and "insert" HMM state -ML, IL, and D states, in the above notation. The profile HMM also has special begin and end states. Profile HMMs could therefore be thought of as a special case of CMs. An unstructured RNA multiple alignment would be modeled by a guide tree of all MATL nodes, and converted to an unbifurcated CM that would essentially be identical to a profile HMM. (The only difference is trivial; the CM root node includes a IR state, whereas the start node of a profile HMM does not.) All the other node types (especially MATP, MATR, and BIF) and state types (e.g. MP, MR, IR, and B) are SCFG augmentations necessary to extend profile HMMs to deal with RNA secondary structure.
The SCFG Inside and Outside algorithms are analogous to the Forward and Backward algorithms for HMMs [24,48]. The CYK/inside parsing algorithm is analogous to the Viterbi HMM alignment algorithm run in the forward direction. CYK/outside is analogous to a Viterbi DP algorithm run in the backwards direction. S v refers to the type of state v; it will be one of seven types {D,P,L,R,S,E,B}. C v is a list of children for state v (e.g. the states that v can transit to); it will contain up to six contiguous indices y with v ≤ y ≤ M. P v is a list of parents for state v (states that could have transited to state v); it will contain up to six contiguous indices y with 1 ≤ y ≤ v. (P v parent lists should not be confused with P state types.)

Divide and conquer algorithm
I use g, h, i, j, k, p, and q as indices referring to positions in a sequence x, where g ≤ h ≤ p ≤ q and i ≤ j for all subsequences of nonzero length. These indices range from 1..L, for a sequence of length L. Some algorithms will also use d to refer to a subsequence length, where d = j -i + 1 for a subsequence x i ..x j .
The algorithms will have to account for subsequences of zero length (because of deletions). By convention, these will be in the off-diagonal where j = i -1 or i = j + 1. This special case (usually an initialization condition) is the reason for the qualification that i ≤ j for subsequences of nonzero length.

Figure 4
Example parse trees. Parse trees are shown for the three sequences/structures from Figure 1, given the CM in Figure 3. For each sequence, each residue must be associated with a state in the parse tree. (The sequences can be read off its parse tree by starting at the upper left and reading counterclockwise around the edge of parse tree.) Each parse tree corresponds directly to a secondary structure -base pairs are pairs of residues aligned to MP states. A collection of parse trees also corresponds to a multiple alignment, by aligning residues that are associated with the same state -for example, all three trees have a residue aligned to state ML4, so these three residues would be aligned together. Insertions and deletions relative to the consensus use nonconsensus states, shown in gray.
The CYK/inside algorithm calculates a three-dimensional matrix of numbers α v (i,j), and CYK/outside calculates numbers β v (i,j). I will refer to v (state indices) as deck coordinates in the three-dimensional matrices, whereas j and i (sequence positions) are row and column coordinates within each deck. α v and β v refer to whole two-dimensional decks containing scores α v (i,j) and β v (i,j) for a particular state v. The dividing and conquering will be done in the v dimension, by choosing particular decks as split points.

The CYK/inside algorithm
The CYK/inside algorithm iteratively calculates α v (i,j)the log probability of the most likely CM parse subtree rooted at state v that generates subsequence x i ..x j of sequence x. The calculation initializes at the smallest subgraphs and subsequences (e.g. subgraphs rooted at E states, generating subsequences of length 0), and iterates outwards to progessively longer subsequences and larger CM subgraphs.
For example, if we're calculating α v (i,j) and S v = P (that is, v is a pair state), v will generate the pair x i , x j and transit to a new state y (one of its possible transitions C v ) which then will have to account for the smaller subsequence x i+1 ..x j-1 . The log probability for a particular choice of next state y is the sum of three terms: an emission term log e v (x i ,x j ), a transition term log t v (y), and an already calculated solution for the smaller optimal parse tree rooted at y, α y (i + 1, j -1). The answer for α v (i,j) is the maximum over all possible choices of child states y that v can transit to.
The algorithm INSIDE is as follows: Input: A CM subgraph and subsequence x g ..x q . Output: Scoring matrix decks α r ..α z . INSIDE(r,z; g,q) for v ← z down to r for j ← g -1 to q We do not have to keep the entire α three-dimensional matrix in memory to calculate these scores. As we reach higher decks α v in the three dimensional dynamic programming matrix, our calculations no longer depend on certain lower decks. A lower deck y can be deallocated whenever all the parent decks P y that depend on it have been calculated. (The implementation goes even further and recycles decks when possible, saving some initialization steps and many memory allocation calls; for example, since values in all E decks are identical, only one E deck needs to be calculated and that precalculated deck can be reused whenever S v = E.) This deallocation rule has an important property that the divide and conquer algorithm takes advantage of when solving smaller subproblems for CM subgraphs rooted at some state w. When the root state w is an S state, the α matrix returned by INSIDE contains only one active deck α w .
(No lower state > w can be reached from any state <w without going through w, so all lower decks are deallocated once deck w is completed.) When the root state w is the first state in a split set w..y (see below for more explanation), all (and only) the decks α w ..α y are active when IN-SIDE returns.
In some cases we want to recover the optimal parse tree itself, not just its score. The INSIDE τ routine is a modified version of INSIDE. It keeps an additional "shadow matrix" τ v (i,j). A τ v (i,j) traceback pointer either records the index y that maximized α v (i,j) (for state types D,S,P,L,R) or records the split point k that maximized α v (i,j) for a bifurcation (B) state. The τ shadow matrix does not use the deallocation rules -INSIDE τ can only be called for problems small enough that they can be solved within our available memory space. Thus the INSIDE τ routine works by calling INSIDE in a mode that also keeps a shadow matrix τ, and then calls a recursive traceback starting with v, i, j: Input: A shadow matrix τ for CM subgraph G v rooted at state v, and subsequence x i ..x j .
Output: An optimal parse tree T.
The CYK/outside algorithm The CYK/outside algorithm iteratively calculates β v (i, j), the log probability of the most likely CM parse tree for a CM generating a sequence x 1 ..x L excluding the optimal parse subtree rooted at state v that accounts for the subsequence x i ..x j . The calculation initializes with the entire sequence excluded (e.g. β 1 (1, L) = 0), and iterates inward to progressively shorter and shorter excluded subsequences and smaller CM subgraphs.
A complete implementation of the CYK/outside algorithm requires first calculating the CYK/inside matrix α because it is needed to calculate β v when the parent of v is a bifurcation [24,43,44]. However, the divide and conquer algorithm described here only calls OUTSIDE on unbifurcated, linear CM subgraphs (only the final state z may be a B state; there are no internal bifurcations that lead to branches in the model). Thus the parent of a state v is never a bifurcation, and the implementation can therefore be streamlined as follows: Input: An unbifurcated CM subgraph and subsequence x g ..x q .
Output: Scoring matrix decks β r ..β z . OUTSIDE(r,z; g,q) for v ← r + 1 to z for j ← q down to g -1 for i ← g to j + 1 As with INSIDE, we do not keep the entire β matrix in memory. A deck β v can be deallocated when all child decks C v that depend on the values in β v have been calculated. This means that if the last deck z is a bifurcation or end state, β z will be the only active allocated deck when OUTSIDE returns. If z is the last state in a split set w..z, all (and only) the split set decks β w ..β z will be active when OUTSIDE returns.

Using CYK/inside and CYK/outside to divide and conquer
Now, for any chosen state v, tells us which cell v, i, j the optimal parse tree passes through, conditional on using state v in the parse. We know that any parse tree must include all the bifurcation and start states of the CM, so we know that the optimal alignment must use any chosen bifurcation state v and its child start states w and y. Thus, we are guaranteed that (when S v = B and C v = w, y): is the optimal overall alignment score, and we also know that gives us a triplet that identifies three cells that must be in the optimal alignment -(v, i, j), (w, i, k), and (y, k + 1, j). This splits the remaining problem into three smaller subproblems -an alignment of the sequence x i ..x k to a CM subgraph w..y -1, an alignment of the sequence x k+1 ..x j to a CM subgraph y..M, and an alignment of the two-piece sequence x 1 ..z i-1 //x j+1 ..x L to a CM subgraph 1..v.
The subproblems are then themselves split, and this splitting can continue recursively until all the bifurcation triplets on the optimal parse tree have been determined.
At this point the remaining alignment subproblems might be small enough to be solved by straightforward application of the standard CYK/inside algorithm (e.g. INSIDE τ ). However, this is not guaranteed to be the case. A more general division strategy is needed that does not depend on splitting at bifurcations.
For the more general strategy we take advantage of the fact that we know that the optimal parse tree must also include one and only one state from the split set of each node (e.g. the non-insert states in the node). Let w..y be the indices of a split set of states in the middle of the current model subgraph. (w..y can be at most 4 states.) We know that gives us a new cell (v, i, j) in the optimal parse tree, and splits the problem into two smaller problems. This strategy can be applied recursively all the way down to single nodes, if necessary. We can therefore guarantee that we will never need to carry out a full CYK/inside alignment algorithm on any subproblem. The most memory-intensive alignment problem that needs to be solved is the very first split. The properties of the first split determine the memory complexity of the algorithm.
The bifurcation-dependent strategy is a special case of this more general splitting strategy, where the B state is the only member of its split set, and where we also take advantage of the fact that α v (i,j) = max k α w (i, k) + α y (k + 1,j). By carrying out the max k operation during the split, rather than before, we can split the current problem into three optimal pieces instead of just two.
If we look at the consequences of these splitting strategies, we see we will have to deal with three types of problems ( Figure 5): • A generic problem means finding the optimal alignment of a CM subgraph to a contiguous subsequence x g ..x q . The subgraph corresponds to a complete subtree of the CM's guide tree -e.g. state r is a start (S), and state z is an end (E). may contain bifurcations. The problem is solved in one of two ways. If contains no bifurcations, it is solved as a wedge problem (see below). Else, the problem is subdivided by the bifurcation-dependent strategy: an optimal triple (i, k, j) is found for a bifurcation state v and its children w, y, splitting the problem into a V problem and two generic problems.
• A wedge problem means finding the optimal alignment of an unbifurcated CM subgraph to a contiguous subsequence x g ..x q . State z does not have to be a start state (S); it may be a state in a split set (MP, ML, MR, or D). State z is an end (E). A wedge problem is solved by the split setdependent strategy: an optimal (v, i, j) is found, splitting the problem into a V problem and a smaller wedge problem.
• A V problem consists of finding the optimal alignment of an unbifurcated CM subgraph to a noncontiguous, two-piece sequence x g ..x h //x p ..x g , exclusive of the residues x h and x p (open circles in Figure 5). State r can be a start state or any state in a split set; the same is true for z. A V problem is solved by a split set-dependent strategy: an optimal (v, i, j) is found, splitting the problem into two V problems.
The three recursive splitting algorithms to solve these problems are as follows: v ← lowest numbered bifurcation state in subgraph . w,y ← left and right S children of v. β v ← OUTSIDE(r,w; g,q) α w ← INSIDE(w,y-1; g,q) α y ← INSIDE(y,z; g,q) T 1 ← V_SPLITTER(r,v; g,i; j,q) T 2 ← GENERIC_SPLITTER(w,y-1; i,k) T 3 ← GENERIC_SPLITTER(y,z; k+l,j) Attach S state w of T 2 as left child of B state v in T 1 . Attach S state y of T 3 as right child of B state v in T 2 . return T 1 .

The vinside and voutside routines
The VINSIDE and VOUTSIDE routines are just INSIDE and OUTSIDE, modified to deal with a two-piece subsequence x g ..x h //x p ..x g instead of a contiguous sequence x g ..x q . These modifications are fairly obvious. The range of i, j is restricted so that i ≤ h and j ≥ p. Also, VINSIDE (w, z; g, h; p, q) initializes α z (h, p) = 0: that is, we know that sequence x h ..x p has already been accounted for by a CM parse tree rooted at z.

Implementation
In the description of the algorithms above, some technical detail has been omitted -in particular, a detailed description of efficient initialization steps, and details of how the the dynamic programming matrices are laid out in memory. These details are not necessary for a high level understanding of the divide and conquer algorithm. However, they may be necessary for reproducing a working implementation. Commented ANSI/C source code for a reference implementation is therefore freely available at [http://www.genetics.wustl.edu/eddy/infernal/] under a GNU General Public License. This code has been tested on GNU/Linux platforms.
In this codebase, the CM data structure is defined in structs.h. The CM construction procedure is in modelmaker. c:Handmodelmaker(). The guide tree is constructed in HandModelmaker(). A CM is constructed from the guide tree by cm_from_master(). Individual parse trees are constructed using the guide tree by transmogrify(). The divide and conquer algorithm is implemented in smallcyk. c:CYKDivideAndConquer(), which will recursively call a set of functions: the three splitting routines GENERIC_SPLITTER (

Memory complexity analysis
The memory complexity of normal CYK/inside is O(N 2 M), for a model of M states and a query sequence of N residues, since the full 3D dynamic programming matrix is indexed N × N × M (and since N ∝ M, we can alternatively state the upper bound as O(N 3 )). The memory complexity of the divide and conquer algorithm is O(N 2 log M). The analysis that leads to this conclusion is as follows.
For a model with no bifurcations, the divide and conquer algorithm will never require more than 10 decks in memory at once. In the case of two adjacent MATP nodes, we will need six decks to store the scores for the current node we're calculating, and four decks for the split set of the adjacent node that we're connecting to (and dependent upon) (Figure 3).
Bifurcations will require some number of additional decks for start states (BEGL_S and BEGR_S) to be kept. In INSIDE, whenever we reach a deck for a start state, we will keep that deck in memory until we reach the parent bifurcation state. Half the time, that will mean waiting until another complete subgraph of the model is calculated (e.g. the subgraph rooted at the other start child of that bifurcation); that is, to calculate deck α v for a bifurcation v, we need both decks α w and α y for its child start states w and y, so we have to hold on to α y until we reach α w . In turn, the subgraph rooted at w might contain bifurcations, so our calculation of α w might require additional decks to be kept. Each start deck we reach in the INSIDE iteration means holding one extra deck in memory, and each bifurcation we reach means deallocating the two start decks it depends on; therefore we can iteratively calculate the maximum number of extra decks we will require: This number depends on the topology and order of evaluation of the states in the CM. Think of the bifurcating structure of the CM as a binary tree numbered in preorder traversal (e.g. left children are visited first, and have lower indices than right children). If this is a complete balanced tree with B bifurcations, we will need log 2 B extra decks. If it is a maximally unbalanced tree in which bifurcations only occur in left children, we will need B extra decks (all the right children). If it is a maximally unbalanced tree in which bifurcations only occur in right children, we will only ever need 1 extra deck. A left-unbalanced binary tree can be converted to a right-unbalanced binary tree just by swapping branches. For a CM, we can't swap branches without affecting the order of the sequence that's generated. We can, however, get the same effect by renumbering the CM states in a modified preorder traversal. Instead of always visiting the left subtree first, we visit the best subtree first, where "best" means the choice that will optimize memory usage. This reordering is readily calculated in O(M) time (not shown; see cm. c:CMRebalance() in the implementation). This way, we can never do worse than the balanced case, and we will often do better. We never need more than log 2 B extra decks. Since B <M, we can guarantee a O(N 2 log M) bound on memory complexity. An example of a pathological case is an RNA structure composed of a series of multifurcation loops such that each bifurcation leads to a small stem on one side, and the rest of the structure on the other. In such a case, every call to GENERIC_SPLITTER will split into a small subproblem containing a small stem (e.g. only removing a constant number of states and residues per split) and a large subproblem containing all the remaining states and sequence. This case can be avoided. It only arises because of the decision to implement a simplified CYK/OUTSIDE algorithm and always split at the highest bifurcations. Better time performance could be guaranteed if a complete CYK/ OUTSIDE algorithm were implemented (at the cost of complexity in the description and implementation of the algorithm). This would allow us to choose a split point in a generic problem at any state in the CM (for instance, in the middle) regardless of its bifurcating structure.

Time complexity analysis
In practice, empirical results on a variety of real RNAs (see below) indicate that the extra time required to do the divide and conquer is a small constant factor. A more complex implementation does not seem to be necessary.

Empirical results
Six structural RNA families were chosen for empirical evaluations of the algorithm, using available RNA database resources -tRNA [49], 5S ribosomal RNA [50], signal recognition particle (SRP) RNA [51], RNase P [52], small subunit (SSU) ribosomal RNA [53], and large subunit (LSU) ribosomal RNA [54]. For each family, a secondary structure annotated multiple alignment of four or five example sequences was extracted, and used to construct a CM. : : , , elsewise.
The top of Table 4 shows some statistics about these alignments and CMs. To determine the memory and CPU time requirements for a structural alignment, one example sequence from each family was aligned to the CM. CPU time was measured and memory requirements were calculated for three algorithms: (1) the full CYK/inside algorithm, but in memorysaving score-only mode (e.g. INSIDE(1,M; 1,L); (2) the full CYK/inside algorithm, with shadow matrix and traceback to recover the optimal alignment (e.g. INSIDE τ (1,M; 1,L)); and (3) the divide and conquer algorithm to recover an optimal alignment (e.g. GENERIC_SPLITTER (1,M; 1,L)). The most important comparison is between the full CYK/inside algorithm and the divide and conquer algorithm. The score-only CYK/inside algorithm was included, because a complete CYK alignment couldn't be done on SSU and LSU rRNA (because of the steep memory require-ment). In all cases where comparison could be done, the scores and alignments produced by these algorithms were verified to be identical. The same results are plotted in Figure 6. Memory requirements scale as expected: N 3 for standard CYK alignment, and better than N 2 log N for the divide and conquer algorithm. Empirical CPU time requirements scale similarly for the two algorithms (N 3.24 -N 3.29 ). The observed performance is better than the theoretical worst case of O (N 4 ). The proportion of extra time required by divide and conquer is roughly constant over a wide range of RNAs. The difference shown in Figure 6 is exaggerated because times are plotted for score-only CYK, not complete CYK alignment, in order to include CPU times for SSU and LSU rRNA. Because score-only CYK does not keep a shadow traceback matrix nor perform the traceback, it is about 20% faster than CYK alignment, as seen in the data in Table 4.

Conclusions
The divide and conquer algorithm described here makes it possible to align even the largest structural RNAs to secondary structure consensus models, without exceeding the available memory on current computational hardware. Optimal SSU and LSU rRNA structural alignments can be performed in 70 MB and 270 MB of memory, re- spectively. Previous structural alignment algorithms had to sacrifice mathematical optimality to achieve ribosomal RNA alignments.
The CPU time requirement of the alignment algorithm is still significant, and even prohibitive for certain important applications. However, CPU time is generally an easier issue to deal with. A variety of simple parallelization strategies are possible. Banded dynamic programming algorithms (e.g. calculating only relevant parts of the matrix) of various forms can also be explored, including not only heuristic schemes, but also optimal algorithms based on branch and bound ideas. (Properly implemented, banded DP algorithms would also save additional memory.) The algorithm takes advantage of the structure of covariance models (profile SCFGs), splitting in the dimension of the states of the model rather than in the sequence dimensions. The approach does not readily apply, therefore, to unprofiled SCFG applications in RNA secondary structure prediction [34,36,55,56], where the states are fewer and more fully interconnected. For these applications, it would seem to be necessary to divide and conquer in the sequence dimensions to obtain any significant improvement in memory requirements, and it is not immediately apparent how one would do this.
The current implementation of the algorithm is not biologically useful. It is meant only as a testbed for the algo-rithm. It outputs a raw traceback structure and alignment score, not a standardly formatted alignment file. Most importantly, the probability parameters for models are calculated in a very quick and simple minded fashion, and are far from being reasonable for producing robustly accurate structural alignments. The next step along this line is to produce good prior distributions for estimating better parameters, by estimating mixture Dirichlet priors from known RNA structures [57]. At this stage it would not be meaningful to compare the biological alignment accuracy of this implementation to (for instance) the excellent performance of the RAGA genetic algorithm [27]. A biologically useful implementation with accurate alignment performance is of course the eventual goal of this line of work, but is not the point of the present paper.

Figure 6
Empirical time and memory requirements for structural alignment. Plots of data from