Algebraic Dynamic Programming over general data structures
 Christian Höner zu Siederdissen^{1, 2, 4}Email author,
 Sonja J Prohaska^{3, 4} and
 Peter F Stadler^{1, 2, 4, 5, 6, 7, 8}
https://doi.org/10.1186/1471210516S19S2
© zu Siederdissen et al. 2015
Published: 16 December 2015
Abstract
Background
Dynamic programming algorithms provide exact solutions to many problems in computational biology, such as sequence alignment, RNA folding, hidden Markov models (HMMs), and scoring of phylogenetic trees. Structurally analogous algorithms compute optimal solutions, evaluate score distributions, and perform stochastic sampling. This is explained in the theory of Algebraic Dynamic Programming (ADP) by a strict separation of state space traversal (usually represented by a context free grammar), scoring (encoded as an algebra), and choice rule. A key ingredient in this theory is the use of yield parsers that operate on the ordered input data structure, usually strings or ordered trees. The computation of ensemble properties, such as a posteriori probabilities of HMMs or partition functions in RNA folding, requires the combination of two distinct, but intimately related algorithms, known as the inside and the outside recursion. Only the inside recursions are covered by the classical ADP theory.
Results
The ideas of ADP are generalized to a much wider scope of data structures by relaxing the concept of parsing. This allows us to formalize the conceptual complementarity of inside and outside variables in a natural way. We demonstrate that outside recursions are generically derivable from inside decomposition schemes. In addition to rephrasing the wellknown algorithms for HMMs, pairwise sequence alignment, and RNA folding we show how the TSP and the shortest Hamiltonian path problem can be implemented efficiently in the extended ADP framework. As a showcase application we investigate the ancient evolution of HOX gene clusters in terms of shortest Hamiltonian paths.
Conclusions
The generalized ADP framework presented here greatly facilitates the development and implementation of dynamic programming algorithms for a wide spectrum of applications.
Keywords
Background
Dynamic Programming (DP) over rich index sets provides solutions of a surprising number of combinatorial optimization problems. Even for NPhard problems such as the Travelling Salesman Problem (TSP) exact solutions can be obtained for moderate size problems of practical interest. The corresponding algorithms, however, are usually specialized and use specific properties of the problem in an ad hoc manner that does not generalize particularly well.
Algebraic dynamic programming (ADP) [1] defines a highlevel descriptive domainspecific language for dynamic programs over sequence data. The ADP framework allows extremely fast development even of quite complex algorithms by rigorously separating the traversal of the state space (by means of context free grammars, CFGs), scoring (in terms of suitable algebras), and selection of desired solutions. The use of CFGs to specify the state space is a particular strength of ADP since it allows the user to avoid indices and control structures altogether, thereby bypassing many of the pitfalls (and bugs) of usual implementations. Newer dialects of ADP [2, 3] provide implementations with a running time performance close to what can be achieved by extensively handoptimized versions, while still preserving most of the succinctness and highlevel benefits of the original ADP language.
Sequence data is not the only type of data for which grammarlike dynamic programs are of interest. Inverse coupled rewrite systems (ICOREs) [4] allow the user to develop algorithms over both, sequence and treelike data. While no implementation for these rewrite systems is available yet, they already simplify the initial development of algorithms. This is important in particular for treelike data. Their nonsequential nature considerably complicates these algorithms. The grammar underlying the alignment of ncRNA family models with CMCompare [5], which simultaneously recurses over two trees, may serve as an example for the practical complications. There are compelling reasons to use DP approaches in particular when more information than just a single optimal solution is of interest. DP over sequences and trees readily allows the enumeration of all optimal solutions, and it offers generic ways to systematically investigate suboptimal solutions and to compute the probabilities of certain subsolutions. Classified dynamic programming [6], furthermore, enables the simultaneous calculation of solutions with different class features via the evaluation algebra instead of constructing different grammars for each class.
An important research goal in the area of dynamic programming algorithms is the development of a framework that makes it easy to implement complex dynamic programs by combining small, simple, and reusable components. A first step in this direction was the introduction of grammar products [7], which greatly simplifies the specification of algorithms for sequence alignments and related dynamic programming tasks that take multiple strings as input. Several straightforward questions, however, still remain unanswered.
An important example is the relationship of Forward/Backward (in the context of linear grammars) [8] and Inside/Outside (in the context of CFGs) [9] algorithms. So far, the two variants need to be developed and implemented independently of each other. The close structural relationship of the two types of recursion has of course been noticed and used explicitly to facilitate algorithm design. The idea of "reverting" the inside production rules was used explicitly to explain backtracing and outside recursions in [10, 11] for the RNARNA interaction problem and in [12] for RNA folding with pseudoknots, albeit without providing a general operational framework. In classical ADP the Inside algorithms are phrased as parsing an input string w.r.t. a given context free grammar. This is not possible in general for the Outside recursion because these operate, conceptually, on the complement of a substring. In some situations it is possible to rescue the ADPstyle approach. For RNA folding, for example, Janssen [13] proposed to concatenate the suffix and the prefix in this order. The Outside recursion is then rephrased as a CFG on this modified string.
since the shortest path through A to i must consist of a shortest path through A ending in some j ∈ A and a final step from j to i.
A classical ADP formulation is impossible because the set A does not admit a string representation so that its subsets could be generated by a fixed set of productions. To split off a particular element {i} from A, for example, one requires a specific production rule of the form A → (A \ {i}) ∪ {i}. This cannot be captured by a fixed CFG since the number of productions grows with the size of A.
Instead of relaxing the constraints on the number of productions we argue here that the solution to this conundrum can be resolved by a redefinition of the concept of parsing so that we can meaningfully write A → Ax for the decomposition of a (nonempty) set into a subset with cardinality one less and the excluded single element. This restores one of the main advantages of ADP, namely the possibility to describe the state space traversal without explicit representation of indices. At the same time we will see below that the same formalism also yields a completely mechanical way to construct Outside recursions from the Inside algorithm. To this end we first consider the conceptually simple case of 1dimensional and 2dimensional linear grammars on strings using HMMs and pairwise sequence alignments as example. We then proceed to RNA folding as an example of a nontrivial CFG. The final step is to introduce an ADPstyle formalism for nontrivial setlike data structures. Up to this point we keep our discussion informal and ignore several technical details. In section Theory we will then follow up with a much more abstract and precise account. Finally, we consider the probabilistic version of the shortest Hamiltonian path problem in the context of the early evolution of HOX gene clusters as a reallife application of our framework.
Case studies
HMMs and the Forward/Backward algorithms
This allows to compute $\mathbb{P}\left(i\in +\right)={f}_{+}\left[i\right]{b}_{+}\left[i\right]$ and $\mathbb{P}\left(i\in \right)={f}_{}\left[i\right]{b}_{}\left[i\right]$.
where ${c}_{i}^{P\to {P}_{c}}:=\left(1{q}^{\pm}\right){a}_{{x}_{i},{x}_{i1}}^{+}$, etc., are the tabulated parameters of the HMM. The initialization P_{0} = M_{0} = 1 follows as the conditional probability of ending in the ϵ state after having read all input. Note that the strucuture of the recursion (5) is completely determined by the productions in equ.(4).
The backward recursion corresponds to a traversal of the input by means of suffixes. To each forward prefix [1..i] we have a matching sufix [i..n], where n is the length of the input. This overlap of corresponding prefix and sufix is just one indication that we might want to modify how we interpret the grammar (4). The fact that the scoring function explicitly refers to transitions, i.e., pairs of consecutive positions gives another hint. In this alternative picture we think of P and M as prefixes in which the last position takes on the role of a boundary ∂M and ∂P. Now we can think of the corresponding sufixes as the complements w.r.t. the input, i.e., to "forward objects" P and M we associate "backward objects" P^{*} and M^{*} so that, in terms of the index sets to which they refer, we have $P\cup {P}^{*}=S=\left\{1\dots n\right\},M\cup {M}^{*}=S,\phantom{\rule{2.36043pt}{0ex}}P\cap {P}^{*}=\partial P$, and $M\cap {M}^{*}=\partial M$. Correspondingly, the terminals can be thought of as pairs of consecutive positions. This provides us with a mechanical way of scoring c as ${c}_{i}^{\pm}$ in the forward recursion and as ${c}_{i+1}^{\pm}$ in the backward recursion, since the terminal c defined for positions (i  1; i) overlaps with the boundary of the forward objects P and M at i  1, while the one defined on (i, i + 1) overlaps on i + 1 with the boundary of the backward objects P^{*} and M*.
The a posteriori probability that sequence position i is in the + state is given by ${P}_{i}{P}_{i}^{*}$. In our framework, we obtain the complete list of these proabilities by using the probability scoring algebra and the formal production rule S → P*P or, equivalently, S → PP*. Writing S → P*P as a production rule from the start symbol ties P and P^{*} together to be complementary, rather than independent nonterminals following the forward and backward grammar. Note that symbolically S → P*P is no longer a linear production. It becomes useful, however, in our formalism to use the notation of production rules to specify any kind of decomposition of a data object. In this setting S → P*P does makes sense: it defines the list of all complementary pairs of inside and outside objects, i.e., it serves as implicit specification of the outside object P^{*} to P. We will formally define this construct in the theory section.
Prefix and sufix style sairwise sequence alignments
As for the HMM we think of A^{*} as a representation of the hole that is left over by A, and we read ${A}^{*}\left(\begin{array}{c}\hfill u\hfill \\ \hfill \hfill \end{array}\right)$ as "fill $\left(\begin{array}{c}\hfill u\hfill \\ \hfill \hfill \end{array}\right)$ into the hole A^{∗} at its r.h.s. end. Clearly S → AA^{∗} and S → A^{∗}A refer to the complete global alignments with all possible "splitting constraints".
The outside productions (9) look like the suffix version of the NW algorithm. Writing the decomposition with full index information, however, shows that there is a suble difference: ${A}_{ij}\mapsto {A}_{i1\phantom{\rule{2.36043pt}{0ex}}j1}\left(\begin{array}{c}\hfill i\hfill \\ \hfill j\hfill \end{array}\right)$ transforms to ${A}_{ij}^{*}\mapsto {A}_{i+1,j+1}^{*}\left(\begin{array}{c}i+1\\ j+1\end{array}\right)$, etc. This highlights the interpretation that A^{*} refers to the "hole" extending to the right from the positions after i and j, i.e., not including i and j itself. While this make little formal difference for the NW algorithm, it does have an important impact in the more complex case of Gotoh's algorithm for affine gap costs [18].
where u and v denote terminal symbols. '−' corresponds to gap opening, while '.' denotes the (differently scored) gap extension.
The interpretation of the nonterminals M, D, and I is determined by the last column of the prefix alignment: it ends in a (mis)match, a deletion, or an insertion, respectively. In contrast to the HMM example of the previous section we do not score transitions here. Thus we interpret the nonterminals as boundaryfree. Hence M^{*} becomes a suffix object complementing a prefix alignment that ends in a (mis)match. Note that this does not mean that M^{∗} itself ends in a (mis)match. Because M , and thus M^{∗} are boundaryfree, the corresponding alignments do not overlap. As a consequence, their scores can be added. This property is required for the evaluation algebras to behave properly.
A first glance, this grammar looks odd. It is not the grammar for the suffixversion of Gotoh's algorithm. Instead, it refers to a rather unusual way of solving the affine gap cost problem. Here the distinction is not made between opening or extending a gap, but rather between closing or extending it. The nonterminals on the r.h.s. of the rule thus refer to the type of alignment that is reached after extending the one on the l.h.s. of the rule by the terminal symbol appearing on the r.h.s. Since our forward recursion (10) is set up to separately score gap opening, i.e., the leftmost gapped position in the alignment, the same must be true for the backward recursion. Since it proceeds from right to left on the input string, we naturally arrive at the algorithmic variant that scores gap closing separately. The corresponding nonterminals therefore depend on how the alignment is continued in the subsequent step.
On a more technical note, the conversion of (10) to (11) does not break the signature type isomorphism between inside and outside variant. Where previously the rule $D\to I\left(\begin{array}{c}\hfill u\hfill \\ \hfill \hfill \end{array}\right)$ makes use of an attribute function with type $\Gamma \times \left(\begin{array}{c}\hfill Char\hfill \\ \hfill \hfill \end{array}\right)\to \Gamma $ for evaluation, this type now corresponds to the rule ${I}^{*}\to {D}^{*}\left(\begin{array}{c}\hfill u\hfill \\ \hfill \hfill \end{array}\right)$. Here Γ is the type of the evaluated parse, e.g. a probability for SCFGs or an energy or a partition function for RNA folding.
One obtains the probability of matching each pair $\left(\begin{array}{c}\hfill i\hfill \\ \hfill j\hfill \end{array}\right)$ via S → MM^{∗} ≡ M^{∗}M. Each M_{ ij } indicates an alignment ending in a match with indices $\left(\begin{array}{c}\hfill i\hfill \\ \hfill j\hfill \end{array}\right)$, while ${M}_{ij}^{*}$ yields alignments where the matching characters at $\left(\begin{array}{c}\hfill i\hfill \\ \hfill j\hfill \end{array}\right)$ "have just been transitioned from".
Inside and outside: RNA folding
The indices here explicitly designate a substring of the input to which a particular nonterminal or terminal symbol refers.
The nonterminal S, which refers to unconstrained secondary structures, has an empty boundary. In contrast, it is natural to think about B_{ ij } as having the closing base pair ⟨i, j⟩ as its boundary. The reason is that the standard energy model for RNAs in general evaluates the "loop" enclosed by ⟨i, j⟩ rather than the pair itself. This is also true for corresponding outside objects, i.e., (i, j) naturally contributes both inside and outside "loops" that it delimits. In the natural way to define outside objects this is different for S and B. The complement of S_{ ij } is ${S}_{i1,j+1}^{*}$ and refers to secondary structures on [1..i  1] ∪ [j + 1..n]. In contrast, the complement of B_{ ij } is ${B}_{ij}^{*}$, referring to secondary structures on the union [1..i] ∪ [j..n]. Thus S ∩ S^{∗} = ∅, while B ∩ B^{*} is the common enclosing base pair.
We can now derive McCaskill's algorithm [21] for computing the base pairing probabilities by (1) writing down the inside grammar (see e.g. [22] for several variants), (2) specifying the evaluation algebra for the partition function (see theory section), (3) generating the outside recursions, and (4) producing the list S → BB^{∗}.
Shortest Hamiltonian paths
We now leave the realm of classical ADP behind and consider dynamic programming algorithms on unordered data structures. This most clearly requires us to rethink what we mean by a "grammar" and by "parsing". Since shortest Hamiltonian paths play a key role in the showcase example later on we introduce them here as an illustrative example.
SHORTEST HAMILTONIAN PATH (SHP). Given a graph G with vertex set V and edge set E and a dissimilarity matrix $d:E\to {\mathbb{R}}_{+}$ the task is to find a path π in V that runs through every vertex and minimizes the total length $\ell \left(\pi \right)={\sum}_{i=1}^{n1}{d}_{\pi \left(i\right)\pi \left(i+1\right)}.$.
in complete analogy to a linear grammar. The point of this section is that we can make this analogy precise and useful.
Let us first consider this rule for an arbitrary set. Then A → Av tells us to split off a single element (atom) from A. On string data structures there is essentially only a single way of doing this, namely to remove a singlecharacter suffix. Removal of a prefix would be encoded by A → vA, i.e. a distinct production. On sets, we now have A possibilities, i.e., we obtain a list of possible decompositions. Since the underlying data structure has no intrinsic order, the productions A → Av and A → vA are of course equivalent. This is not different from CFGs, in fact: A production of the form A → BC returns A + 1 partitionings of A into a prefix B and a suffix C. Of course A → BC also makes perfect sense for sets: B and C now form the bipartitions of A, i.e., there are 2^{ A } alternative decompositions. The only difference between strings and sets thus is the number of alternative parses.
In the SHP example there is a further complication: we have to keep track of the end points i and j. Instead of regular sets, we thus have an additional "punctuation" structure that defines a start and end point. The parsing rule A → Av now has to know that (i) the end point has to be split off, and in doing so, (ii) a new endpoint distinct from the startpoint has to be determined so that we obtain again a properly punctuated set. Furthermore it becomes the parser's job to know that (iii) the terminal v is the connection between new and old endpoint, rather than just a splitoff vertex. We note in passing that simply reinterpreting the punctuated sets A as connected components so that neither end point is a cut vertex of the input graph may increase practical efficiency since it prunes early those subsets through which no path connecting the end points can be constructed. Whether we want to think of the startand endpoints as distinct features, or whether either one can be used to decompose the set, is a matter of modelling and defines two distinct data types.
and assume that our machinery knows that V is an unstructured set, while A is a set with start and endpoint. The triviallooking rule therefore provides a list of V(V − 1) or V(V − 1)/2 punctuated sets, depending on whether start and end are distinguished or not.
as the corresponding grammar for the outside objects.
in the indexbased notation. Similarly, P (ends=p, q) = Z(S_{ pq } )/Z(S) provides us with the probabilities that the path ends in p and q and $P\left(\mathsf{\text{end}}=\mathsf{\text{p}}\right)={\sum}_{q}P\left(\mathsf{\text{ends}}=p,q\right)$ measures how frequently we expect p to be an end point of a path. We make use here of the distinction between the start symbol S, which refers to a set without boundary, and S_{ pq } , a set [p, A, q] with two boundary points p and q defining the end points of the paths running through it. Thus S → S_{ pq } with "sum" as choice function and the identity attribute function (attribute functions of an algebra evaluate individual parses) yields Z(S), summing over all (p, q). On the other hand, S → S_{ pq } with the identity choice function and attribute function $\lambda z.\frac{z}{Z\left(S\right)}$ (where $\lambda z.\mathcal{B}$ denotes an anonymous function with body $\mathcal{B}$ expecting z as its single argument) returns a list of all (p, q) start/end points, together with the probabilities that these points are start and end point. To obtain the end probabilities in our framework we need an additional type of nonterminals, say S_{ p }, that have only a single point as boundary, i.e., it refers to sets of the form [p, A]. The production S → S_{ p } with the identity choice and attribute $\lambda z.\frac{z}{Z\left(S\right)}$ then yields the desired end probabilities. S_{ p } → S_{ pq } folds over all S_{ ip } and S_{ pi } for all i, as we now do not distinguish between a 'starting' and 'ending' point in a path for S_{ p }.
The formal framework
The key ingredient in our approach is to generalize grammars to decomposition schemes of a wide range of data models by redefining what exactly parsers do. Let us start with making explicit how this works in classical ADP, i.e., for (context free) grammars on strings:
1 Each (non)terminal corresponds to a substring.
2 Each terminal symbol matches a single character of the input string.
3 Each production defines a (list of) partitions. More precisely, the substrings corresponding to the r.h.s. of the production partition the substring corresponding to the single nonterminal on its l.h.s.
4 The partition is order preserving, i.e., the sequence of symbols on the r.h.s. matches the order of the corresponding substrings on the input string.
We have seen that it may be convenient already in the realm of strings to give up some of these requirements e.g. to treat problems in which terminals are naturally interpreted as transitions between adjacent positions as in the HMM case.
for which we require the following properties:
(C1) ${\bigcup}_{i}{A}_{i}=A$, i.e., the decomposition products of A form a covering of A.
(C2) int A_{ i } ∩ int A_{ j } ≠ ∅ implies i = j, i.e., the interiors of the parts are disjoint.
(C3) int A_{ i } ⊆ int A, i.e., the interiors behave like isotonic functions.
Note that these axioms recover the partitionstyle parsing if all data objects have empty boundaries. In this case (C3) follows from (C2). Whether we treat the ${\bigcup}_{i}\left[{A}_{i},\phantom{\rule{2.36043pt}{0ex}}\partial {A}_{i}\right]$ as an ordered list or as a multiset (or as something inbetween) depends on the intrinsic internal order structure of A. Here we have encountered only total orders (on strings) and antichains (for sets). Nontrivial partial orders however may become important when dealing with tree structures.
The parsers can infer much of the necessary index handling from considering metarules for handling adjacent boundaries. For instance, it will be useful in many cases to declare that boundaries of adjacent objects can overlap only if they coincide. In the RNA example enclosing base pairs appear as object boundaries. Semantically, it make no sense to allow overlap of base pairs at one end but not at the other. In other cases, however, it is useful to require only that ∂A_{ i } ⊆ A_{ j } or vice versa. This is the case in the HMM and SHP example, where we might want to interpret the terminals as transitions and edges as having empty interior and thus consisting of boundary only. DP algorithms where the index arithmetic in the decompositions is even more complex for instance appear in RNAwolf [23] and in the context of the coloring problems associated with RNA design in [24].
where γ is a set of terminals and nonterminals. By construction there is exactly one syntactic outside variable on the r.h.s. of an outside production rule. All other symbols on the r.h.s. are either terminal symbols or inside symbols. From the perspective of the outside variables, they behave as "syntactic terminals", i.e., in a combined inside/outside grammar none of their derivations ever reaches an outside variable. As an immediate consequence we conclude that the outside grammar is bilinear (and even linear in the unordered case) in its outside syntactic variables. Given a description of parsing we can now use the conventional ADP framework.
Start symbols, stop symbols, and normalized grammars
 (1)
Every production rule S → α with the start symbol S on the l.h.s. has a single nonterminal (or syntactic variable) on the r.h.s.
 (2)
All production rules with only terminals on the r.h.s. have just a single ε (and no other symbol) on the r.h.s.
While it is not strictly necessary to work with normalized grammars, they are practically convenient because normalization guarantees that start and stop rules that isomorphic evaluation function types in their respective inside and outside version. It is easy to see that if $\mathcal{G}$ is normalized, then its outside variant ${\mathcal{G}}^{*}$ is also normalized.
In the context of multidimensional grammars for alignments we have to deal with gap symbols referring to an empty input on one or more input tapes. Although gaps are superficially similar to stop symbols they only appear in the context of actually parsing an input symbol, albeit on another tape, they are handled just like any other characterparsing terminal. In particular they do not give rise to a start symbol in the outside grammar.
Combining Inside and Outside variables: a posteriori probabilities
ADP grammars come with a signature that describes the types of the attribute functions attached to each production rule. One of the fringe benefits of constructing the outside grammar automatically according to equ.(15) or equ.(25) is that the inside and outside grammars are guaranteed to be isomorphic with respect to their signature. This, in turn, simplifies reuse of evaluation algebras between inside and outside.
The formal production S → P^{∗}P ≡ PP^{∗}, as an inside rule, states that parses p^{∗} should be combined with corresponding parses p. This is, however not a normal contextfree rule. If P and its outside complement P^{∗} come from a linear grammar with a setbased index type, then the intuition is correct. For a linear grammar on a string index type, this looks intuitively correct, but the underlying index type does not admit a contextfree grammar at all as linear grammars have a fixed left or right end point for each subparse. This issue can be resolved by observing that S → PP^{∗} in an inside context translates into generating all possibilities of splitting (the index representation of) the complete input into an inside and an outside part. For linear string grammars there are the O(n) ways to split 0 ≤ k ≤ n at k; for string CFGs there are O(n^{2}) ways to split 0 ≤ k ≤ l ≤ n at (k, l), and linear set grammars yield O(2^{ n } ) different split points of a set with n bit. For multitape grammars, the behaviour follows in an analog fashion.
The production S → PP^{∗} requires an attribute function evaluating each parse (p, p^{∗}) ∈ P, P^{∗}, and a choice function. The evaluation algebra for probabilities or partition functions (which are essentially unnormalized probabilities) comprises multiplication for terms appearing in decompositions and addition for alternative productions from the same nonterminal. For the terminals, score values are tabulated as parameters. In the case of RNA folding, these are the Boltzmann factors exp(−E(t)/RT ) of the energies E(t) associated with the terminal t. For the RNA toy model of Sec. we have E(t) = 1 for a base pair terminal and E(t) = 0 for an unpaired terminal. In the more realistic setting, the loop energies of the Turner model are used. The practical evaluation will typically be along the lines of $\lambda pq.\frac{p\times q}{Z}$ to yield the probability, where the normalization constant Z is obtained by evaluating the start nonterminal.
Application: shortest Hamiltonian paths and gene cluster histories
Local duplication of DNA segments via unequal crossover is the most plausible mechanism for the emergence and expansion of local clusters of evolutionary related genes. Although there are polynomialtime algorithms to reconstruct duplication trees from pairwise evolutionary distance data [25] this approach often fails to resolve the ancient history of gene clusters. The reason is the limited amount of phylogenetic information in a single gene. The situation is often aggravated by the extreme time scales leading to a decay of the phylogenetic signal so that only a few, very wellconserved sequence domains can be compared. A large number of trees then fits the data almost equally well. A meaningful analysis of the phylogeny thus must resort to some form of summary information that is less detailed than a fully resolved duplication tree. In the absence of genome rearrangements, and if duplication events are restricted to copying single genes to adjacent positions, we expect genetic distance to vary monotonically with genomic distance, i.e., we expect  at least approximately  to have d_{ ik } ≥ max(d_{ ij }, d_{ jk } ) whenever gene j lies between i and k on the genome. The same is true if gene duplications arise by unequal crossover and subsequent divergence rates are comparable. This socalled Robinson property ensures that a shortest Hamiltonian path through the genetic distance matrix conforms to the linear arrangement of the genes on the genome [26]. A mathematically more precise exposition of the role of short Hamiltonian paths in clusters of paralogous genes can be found in a forthcoming manuscript [27].
The same high noise level that suggests to avoid duplication trees should also make us distrust the shortest path. More robust results can be expected by considering the information on the ensemble of all Hamiltonian paths. We therefore compute the probabilities P (i ~ j) of the individual adjacencies assuming a Boltzmann weighting p(π) ∝ exp(−ℓ(π)/RT ) of the Hamiltonian paths π. The parameter T is a fictitious temperature governing the relative importance of short versus long paths π. For T → 0 we focus on the (co)optimal paths only, while T → ∞ leads to a uniform distribution of adjacencies. The normalization constant is conveniently set to $R=\left(n1\right)\stackrel{\u0304}{d}$, where $\stackrel{\u0304}{d}$ is the average of the genetic distance between genes. The path length ℓ(π) plays the role of the energy in the partition function of RNA secondary structures and of the dissimilarity score in probabilistic alignment algorithms. As we have seen above, the The ADPstyle framework provides us with an easy and efficient way to compute the probabilites P (i ~ j) of adjacencies along short Hamiltonian paths and the probabilities P (end = i) that gene i is the endpoint of a short path. In intact clusters we expect that the ends of genomic cluster also appear as the most probable ends of the Hamiltonian paths. High probabilities in the interior, by contrast, are a good indicator of rearrangments.
Hox genes are ancient regulators originating from a single Hox gene in the metazoan ancestor. Over the course of animal evolution the Hox cluster gradually expanded to 14 genes in the vertebrate ancestor [28]. Timing and positioning of Hox gene expression along the body axis of an embryo is colinear with the genomic arrangement in most species. Only the 60 amino acids of the socalled homeodomain can be reliably compared at the extreme evolutionary distances involved in the evolution of the Hox system. We quantitatively measure the genetic distance of the homeodomain sequences either using the Hamming distance, i.e. the number of different aminoacids, or the transformation dab = s(a, a) + s(b, b) − 2s(a, b) of the BLOSSUM45 similarity scores.
Discussion
We have taken here the first step towards extending algebraic dynamic programming (ADP) beyond the realm of stringlike data structures. Our focus is an efficient, yet notationally friendly way to treat DP on unordered sets. As a showcase application we used ADP on sets to demonstrate that statistics over Hamiltonian paths can be computed efficiently as means of analyzing the ancient evolution of gene clusters. This extension of ADP builds on the same foundation (namely ADPfusion [2]) as our grammar product formalism [7, 30]. The key idea is to redefine the rules of parsing to match the natural subdivisions of the data type that now may be much more general than strings. In the case of sets, these are bipartitions and the splitting of individual elements, rather than the subdivision of an interval or the removal of a boundary element that are at the heart of string grammars. A particularly useful feature of our work is the ADPstyle implementation and a principled approach to constructing outside algorithms, which is a rather straightforward consequence of defining the complement of a substructure relative to the input data object. There are several advantages to this approach:

One cannot forget contributions to outside recursions. Such missing rules render the algorithm invalid, sometimes in nonobvious ways. This is of particular relevance for complex grammars and when existing algorithms are to be modified.

Together with the ADPfusion framework the most annoying type of bugs in practical implementations of DP, namely index errors, can be avoided altogether because all index arithmetic is implicit and hidden completely from the user.

Outside grammar construction is independent of syntactic variable and terminal types. As long as the abstract grammar is a contextfree grammar, an outside version can be constructed.

Our mechanistic construction interacts smoothly with other systems that automate creation of formal grammars, e.g. grammar products [7].
Our current framework still lacks generality and completeness in several respects. It is evident from our example above that data objects of different types can be obtained in the decomposition. For these, parsing may then mean different, typedependent things. For instance, in the context of forest alignment and forest editing, reviewed in [31], it may be useful to distinguish trees from general forests. This suggests the possibility to develop an algebraic formalism of parsing/decomposition for complex data objects and thus an even higherlevel way of specifying the intricacies of parsing schemes underlying DP algorithms. McBride's notion of a derivative operator acting on data types [32] appears to be a relevant starting point in this direction, although it does not seem to be directly applicable.
Although our present framework requires that parsing methods have to be specified for novel data types such as the punctuated sets used in the context of Hamiltonian paths, this has to be done only once and can reused without additional overhead for all DP scenarios on the same data types. In particular, our system already handles all CFGs (and thereby also all linear grammars) on either strings or (punctuated) sets and automatically provides the associated outside algorithms. The highlevel framework described here does not require much of a compromise in terms of computational efficiency. While we have to accept a decrease in theoretical performance by a moderate constant factor the gains in ease of algorithm design and actual software development are well worth this price. In the ADPfusion framework we currently have to accomodate approximately a doubling of the running time compared to expertoptimized implementations. Conceptually, the framework extends to multicontextfree grammars (MCFGs) and thus holds promise to drastically simplify the implementation of algorithms for RNA folding with pseudoknots and complex RNARNA interaction structures. Ongoing work in this area aims at formalizing MCFGADP theory [33] and the efficient implementation of the necessary parsers in ADPfusion [34].
Availability
The algorithms described in this work are part of the generalized ADP framework. Available here: http://www.bioinf.unileipzig.de/Software/gADP/
Declarations
Acknowledgements
This work was funded, in part, by the Austrian FWF, project "SFB F43 RNA regulation of the transcriptome", the Templeton Foundation, grant # 24332 "Origins and Evolution of Regulation in Biological Systems", and the DFG project "MI439/141". Publication charges for this article have been funded by the University Leipzig and the Deutsche Forschungsgemeinschaft (DFG).
This article has been published as part of BMC Bioinformatics Volume 16 Supplement 19, 2015: Brazilian Symposium on Bioinformatics 2014. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S19
Authors’ Affiliations
References
 Giegerich R, Meyer C: Algebraic dynamic programming. Algebraic Methodology And Software Technology. Lect. Notes Comp. Sci. Edited by: Kirchner, H., Ringeissen, C. 2002, Springer, Berlin, Heidelberg, 2422: 349364.View ArticleGoogle Scholar
 Höner zu Siederdissen C: Sneaking around concatMap: efficient combinators for dynamic programming. Proceedings of the 17th ACM SIGPLAN International Conference on Functional Programming (ICFP'12). 2012, ACM, New York, 215226.View ArticleGoogle Scholar
 Sauthoff G, Janssen S, Giegerich R: Bellman's GAP  a declarative language for dynamic programming. Proceedings of the 13th International ACM SIGPLAN Symposium on Principles and Practices of Declarative Programming (PPDP'11). 2011, ACM, New York, 2940.Google Scholar
 Giegerich R, Touzet H: Modeling dynamic programming problems over sequences and trees with inverse coupled rewrite systems. Algorithms. 2014, 62144.Google Scholar
 Höner zu Siederdissen C, Hofacker IL: Discriminatory power of RNA family models. Bioinformatics. 2010, 26: 453459.View ArticleGoogle Scholar
 Voß B, Giegerich R, Rehmsmeier M: Complete probabilistic analysis of RNA shapes. BMC Biology. 2006, 4: 5PubMedView ArticlePubMed CentralGoogle Scholar
 Höner zu Siederdissen C, Hofacker IL, Stadler PF: Product grammars for alignment and folding. IEEE/ACM Trans. Comp. Biol. Bioinf. 2014, 99:Google Scholar
 Rabiner LR: A tutorial on hidden markov models and selected applications in speech recognition. Proc. IEEE. 1989, 77: 257286.View ArticleGoogle Scholar
 Baker JK: Trainable grammars for speech recognition. J. Acoust. Soc. Am. 1979, 65: 132View ArticleGoogle Scholar
 Huang FWD, Qin J, Reidys CM, Stadler PF: Partition function and base pairing probabilities for RNARNA interaction prediction. Bioinformatics. 2009, 25: 26462654.PubMedView ArticleGoogle Scholar
 Huang FWD, Qin J, Reidys CM, Stadler PF: Target prediction and a statistical sampling algorithm for RNARNA interaction. Bioinformatics. 2010, 26: 175181.PubMedView ArticlePubMed CentralGoogle Scholar
 Reidys CM, Huang FWD, Andersen JE, Penner RC, Stadler PF, Nebel ME: Topology and prediction of RNA pseudoknots. Bioinformatics. 2011, 27: 10761085. Addendum in: Bioinformatics 28:300 (2012)PubMedView ArticleGoogle Scholar
 Janssen S: Kisses, ambivalent models and more: Contributions to the analysis of RNA secondary structure. PhD thesis, Univ. Bielefeld. 2014, urn: nbn:de:hbz:36126821318Google Scholar
 Bellman R: Dynamic programming treatment of the travelling salesman problem. J. ACM. 1962, 9: 6163.View ArticleGoogle Scholar
 Held M, Karp RM: A dynamic programming approach to sequencing problems. J. SIAM. 1962, 10: 196201.Google Scholar
 Durbin R, Eddy SR, Krogh AGM: Biological Sequence Analysis. 1998, Cambridge University Press, CambridgeView ArticleGoogle Scholar
 Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 1970, 443453. 48Google Scholar
 Gotoh O: Alignment of three biological sequences with an efficient traceback procedure. J. theor. Biol. 1986, 121: 327337.PubMedView ArticleGoogle Scholar
 Lorenz R, Bernhart SH, Höner zu Siederdissen C, Tafer H, Flamm C, Stadler PF, Hofacker IL: ViennaRNA Package 2.0. Alg. Mol. Biol. 2011, 6: 26View ArticleGoogle Scholar
 Wuchty S, Fontana W, Hofacker IL, Schuster P: Complete suboptimal folding of RNA and the stability of secondary structures. Biopolymers. 1999, 49 (2): 145165.PubMedView ArticleGoogle Scholar
 McCaskill JS: The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers. 1990, 29: 11051119.PubMedView ArticleGoogle Scholar
 Dowell RD, Eddy SR: Evaluation of several lightweight stochastic contextfree grammars for RNA secondary structure prediction. BMC Bioinformatics. 2004, 5: 71PubMedView ArticlePubMed CentralGoogle Scholar
 Höner zu Siederdissen C, Berhart SH, Stadler PF, Hofacker IL: A folding algorithm for extended RNA secondary structures. Bioinformatics. 2011, 27: 129137.View ArticleGoogle Scholar
 Höner zu Siederdissen C, Hammer S, Abfalter I, Hofacker IL, Flamm C, Stadler PF: Computational design of RNAs with complex energy landscapes. Biopolymers. 2013, 99: 11241136.PubMedGoogle Scholar
 Elemento O, Gascuel O: An efficient and accurate distance based algorithm to reconstruct tandem duplication trees. Bioinformatics. 2002, 8 (Suppl 2): 9299.View ArticleGoogle Scholar
 Robinson WS: A method for chronologically ordering archaeological deposits. Amer. Antiquity. 1951, 16: 293301.View ArticleGoogle Scholar
 Prohaska SJ, Höner zu Siederdissen C, Stadler PF: Expansion of gene clusters and the shortest Hamiltonian path problem. 2015Google Scholar
 GarciaFernàndez J: Hox, parahox, protohox: facts and guesses. Heredity. 2005, 94: 145152.PubMedView ArticleGoogle Scholar
 Cameron RA, Rowen L, Nesbitt R, Bloom S, Rast JP, Berney K, ArenasMena C, Martinez P, Lucas S, Richardson PM, Davidson EH, Peterson KJ, Hood L: Unusual gene order and organization of the sea urchin Hox cluster. J Exp Zoolog B Mol Dev Evol. 2006, 306: 4558.View ArticleGoogle Scholar
 Höner zu Siederdissen C, Hofacker IL, Stadler PF: How to multiply Dynamic Programming algorithms. Brazilian Symposium on Bioinformatics (BSB 2013). Lect. Notes Bioinf. 2013, Springer, Heidelberg, 8213: 8293.Google Scholar
 Billie P: A survey on tree edit distance and related problems. Theor. Comp. Sci. 2005, 337: 217239.View ArticleGoogle Scholar
 McBride C: Clowns to the left of me, jokers to the right (pearl): dissecting data structures. 2008, 43: 287295.Google Scholar
 Riechert M: Algebraic dynamic programming for multiple contextfree languages. Master's thesis, Hochschule für Technik, Wirtschaft und Kultur, Leipzig. 2013Google Scholar
 Riechert M, Höner zu Siederdissen C, Stadler PF, Waldmann J: Algebraic dynamic programming for multiple contextfree languages. 2015Google Scholar
 Höner zu Siederdissen C, Prohaska SJ, Stadler PF: Dynamic programming for set data types. Advances in Bioinformatics and Computational Biology: BSB 2014. Lect. Notes Comp. Sci. Edited by: Campos, S. 2014, 8826: 5764.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.