This section deals with our novel graph-based approach to structural RNA alignment. We first give the problem definition and then describe the graph-theoretical model we use, which combines the models presented in [43] and [44]. We convert the nucleotides of the input sequences into vertices of a graph, and we add edges between the vertices that represent either structural information or possible alignments of pairs of nucleotides. Based on the graph model we develop an integer linear programming formulation. We find solutions using an algorithmic approach employing methods from combinatorial optimization. For sake of simplicity, we will limit the description to the two-sequence case. We want to stress, however, that the model can be extended to the multiple case without changing the core algorithms and ideas. The interested reader is referred to an extensive theoretical description including proofs and a computational complexity discussion appearing elsewhere [16].
2.1 Graph-theoretical model for structural RNA alignment
Problem definition
Given two RNA sequences, we denote by an alignment of the two sequences. Let s
S
() be the sequence score of alignment including gap penalties, and let s
P
() be the score of structural features that are conserved by the alignment . We now aim at maximizing the combined sequence-structure score, that is, we search for an alignment with
Figure 4 gives a toy example showing two annotated sequences and two possible alignments, one maximizing the score of sequence and structure, and the other one just the sequence score alone. This problem definition comprises the one addressed by Bafna et al. in [22]: Our model, however, also allows tertiary elements, which is not covered by their recursions.
Basic model
Let s = s1, ..., s
n
be a sequence of length n over the alphabet Σ = {A, C, G, U}. A pair (s
i
, s
j
) is called an interaction if i < j, and nucleotide i pairs with j. In most cases, these pairs will be Watson-Crick or wobble base pairs. The set p of interactions is called the annotation of sequence s. Two interactions (s
k
, s
l
) and (s
m
, s
o
) are said to be inconsistent, if they share one base; they form a pseudoknot if they "cross" each other, that is, if k <m <l <o or m <k <o <l. A pair (s, p) is called an annotated sequence. Note that a structure where no pair of interactions is inconsistent with each other forms a valid secondary structure of an RNA sequence, possibly with pseudoknots.
We are given two annotated sequences (sA, pA) and (sB, pB) and model the input as a structural graph G
S
= (V, L). The set V denotes the vertices of the graph, in this case the bases of the sequences, and we write and for the i th base in sequence A and B, respectively. The set L contains undirected alignment edges between vertices of sequences A and B, for sake of better distinction called lines. A line l ∈ L with l = () represents the alignment of the k-th character in sequence A with the l-th character in sequence B. By s(l) and t(l) we refer to the adjacent vertices of line l in sequence A and B, respectively. A subset ⊂ L represents a valid sequence alignment of sequence A and B, if there are no two lines k, l ∈ such that k and l cross or touch each other [40]. Crossing or touching lines induce ordering conflicts in the alignment (see Fig. 5 for an illustration). We denote with the set the collection of all maximal sets of mutually conflicting lines.
We extend the original graph G
S
= (V, L) by the edge set I to model the annotation of the input sequences in our graph. Consequently, we have interaction edges between vertices of the same sequence, i.e., an edge () representing the interaction between nucleotides i and j in sequence A. Figure 6 illustrates these definitions by means of an example. Note that at this stage gaps are not modelled in our formulation. Hence, we have to extend our model to incorporate gap penalties in our model.
Gap edges
The initial model containing only lines (the set L) and interaction edges (the set I) is augmented by a set of gap edges G, which represents gaps in the alignment. For sake of compactness, we just describe the gap edges of sequence A, the gap edges of sequence B are defined analogously: We have an edge from to with k, l ∈ 1, ..., |sA| representing the fact that no character of the substring is aligned to any character of the sequence B, whereas (if k - 1 > 0) and (if l + 1 ≤ |sB|) are aligned with some characters in sequence B. We say that are spanned by the gap edge . Figure 7 shows the graph extended by gap edges.
Two gap edges and ∈ G are in conflict with each other if {k, ..., l + 1} ∩ {m, ..., n} ≠ ∅, that is, if they overlap or touch. This is intuitively clear, because we do not want to split a longer gap into two separate gaps: Consequently, there has to be at least one aligned character between any two realized gap edges. See Fig. 8 for an example. We denote with the set the collection of all maximal sets of mutually conflicting gap edges. Finally, we define as the set of gap edges that span the nodes .
Interaction match
We call two interactions and an interaction match if there exist two alignment edges and that do not cross each other. We say that a subset ⊆ L realizes the interaction match if {a, b} ⊆ . Interaction matches realized by a set represent common interactions that are preserved by aligning the begin and end nucleotides of the interaction. Figure 9 illustrates the definitions.
Gapped structural trace
A triple (, , ) with ⊆ L, ⊆ I, and ⊆ G is called a valid gapped structural trace if and only if the following constraints are satisfied:
-
1.
The vertices and of sequences A and B are either incident to exactly one alignment edge e ∈ or spanned by a gap edge g ∈ . In other words, a nucleotide is either aligned or "aligned" to a gap.
-
2.
A line l can realize at most one interaction match (l, m), because a nucleotide can pair with at most one other nucleotide in a valid RNA secondary structure.
-
3.
There are no two lines k, l ∈ L that cross or touch each other: Crossing lines induce ordering conflicts in the alignment, whereas touching lines imply that two different nucleotides are mapped to the same nucleotide in the other sequence.
-
4.
There are no two gaps edges such that is in conflict with , and there are no two gaps edges such that is in conflict with .
Figure 10 visualizes these properties by showing a toy example for a gapped structural trace.
We assign weights w
l
and w
kl
for each line l and interaction match (k, l) that represents the benefit of realizing l or (k, l). By default, we set these scores along the lines of standard scoring methods, e.g., BLOSUM matrices for the weight of the lines, base pair probabilities [18] for the interaction match scores, or by using the RIBOSUM scoring matrices derived from alignments of ribosomal RNAs [47]. Our model, however, is not limited to standard scoring schemes. Since we can set each (sequence or structure) weight separately, the user can assign completely arbitrary scores to each line or interaction match which makes the incorporation of expert knowledge into the computation of structural alignments easy. Furthermore, we assign negative weights to gap edges with representing the gap penalty for aligning substring with gap characters. Note that the model allows for arbitrary, position-dependent gap scoring.
Approaches for traditional sequence alignment aim at maximizing the score of edges in an alignment . Structural alignments, however, must also take the structural information encoded in the interaction edges into account. The problem of structurally aligning two annotated sequences (sA, pA) and (sB, pB) corresponds to finding an alignment such that the weight of the sequence part (i.e., the weight of selected lines plus gap penalties) plus the weight of the realized interaction matches is maximal. More formally, we seek to maximize , where (, ) represents an alignment with arbitrary gap costs, and contains the interaction matches realized by . Observe that this graph-theoretical reformulation matches the problem statement given at the beginning of this section.
Biological aspects
The basic entities of our model are the alignment, interaction, and gap edges in the structural graph, which contribute to the objective function rather independently. Hence, one could argue that the model does not capture important features of RNA structures, like the incorporation of stacking energies or loop scores that depend on the actual size of the loop. We are aware of these limitations.
Nevertheless, the results of our computational experiments presented in Sect. 3 show that this approach yields high-quality structural alignments. In the pairwise case, our graph-based model is competitive with state-of-the-art approaches and develops its strength with an increasing number of sequences, outperforming all other programs that we tested (for details see Sect. 3). Additionally, the authors of [48] showed that models that do not capture stacking energies and loops are still competitive.
Beyond, our graph-based approach offers the possibility to change the model from nucleotides as the working entities to stems: Instead of taking single nucleotides as the vertices of the structural graph, we could search for candidate stems in the sequences and introduce a vertex for each half-stem. This would allow us to incorporate energy-based scoring into our model, which then, however, will have to be adapted to take into account overlapping stem candidates.
2.2 Integer linear program and Lagrangian relaxation
Given the graph-theoretical model it is straightforward to transform it to an integer linear program (ILP). We associate binary variables with each line, interaction match, and gap edge, and model the constraints of a valid gapped structural trace by adding inequalities to the linear program.
The handling of lines and gap edges is straightforward: We associate a x and z variable to each line and gap edge, respectively. We set x
l
= 1 if and only if line l ∈ L is part of the alignment , and z
a
= 1 if and only if gap edge a ∈ G is part of the alignment.
Interaction matches, however, are treated slightly differently: Instead of assigning an ILP variable to each interaction edge, we split an interaction match (l, m) into two separate directed interaction matches (l, m) and (m, l) that are detached from each other. A directed interaction match (l, m) is realized by the line set if l ∈ . We then have y
lm
= 1 if and only if the directed interaction match (l, m) is realized (note again that y
lm
and y
ml
are distinct variables). Figure 11 gives an illustration of the variable splitting. Note that this does not change the underlying model, it just makes the ILP formulation more convenient for further processing. Splitting interaction matches has first been proposed by Caprara and Lancia in the context of contact map overlap [42].
As described in Sect. 2.1, the sets L, I, and G refer to lines, interaction edges, and gap edges, and the sets and contain subsets of mutually conflicting lines or gap edges.
We then give the following ILP formulation for the gapped structural trace problem:
(1)
(2)
(3)
(4)
(5)
(6)
y
lm
= y
ml
∀l, m ∈ L (7)
x ∈ {0, 1}Ly ∈ {0, 1}L×Lz ∈ {0, 1}G (8)
Lemma 2.1 (Proof in [16])
A feasible solution to the ILP (1)–(8) corresponds to a valid gapped structural trace of weight equal to the objective function and vice versa.
Observe that constraints (2)–(6) exactly correspond to the properties of a gapped structural trace as described in Sect. 2.1.
In [49] the authors show that the problem of computing an optimal gapped structural trace is already NP-hard, even without considering gap costs. Hence, we cannot hope to find an optimal solution to the problem in polynomial time.
Commonly used mathematical programming techniques for NP-hard problems therefore resort to various relaxation techniques that are the basis for further processing. A relaxation results from the removal of constraints from the original ILP formulation, and is often solvable in polynomial time. A popular relaxation is the so called LP relaxation where the integrality constraints on the variables are dropped, yielding a standard linear program, for which solutions can be found efficiently.
Another possible relaxation technique is Lagrangian relaxation: Instead of just dropping certain inequalities, we move them to the objective function, associated with a penalty term that becomes active if the dropped constraint is violated. By iteratively adapting those penalty terms using, for instance, subgradient optimization, we get better solutions with each iteration. A crucial parameter is therefore the number of iterations that we perform: the higher the number, the more likely it is to end up with an optimal or near-optimal solution.
Inspired by the successful approach of Lancia and Caprara for the contact map overlap problem, we consider the relaxation resulting from moving constraint (7) into the objective function.
Lemma 2.2 (Proof in [16])
The relaxed problem is equivalent to the pairwise sequence alignment problem with arbitrary gap costs.
2.3 Algorithms for the pairwise and multiple case
Our algorithm for the pairwise RNA structural alignment problem consists of iteratively solving the primary sequence alignment problem associated with the relaxation. The penalization of the relaxed inequality is reflected in an adapted scoring matrix for the primary alignment. Intuitively, these weights incorporate also the structural information. In each iteration we get a new lower bound for the problem by analyzing the primary sequence alignments and inferring the best structural completion of this alignment. In fact, this corresponds to solving a maximum weighted matching problem in a general graph. For details see [16]. In the course of the algorithm, these solutions get better and better. Furthermore, the value of the relaxation itself constitutes an upper bound on the problem, which decreases with an increasing number of iterations. When these bounds coincide, we have provably found an optimal solution, otherwise, we get near-optimal solutions with a quality guarantee. Assuming an upper bound on the number of interaction matches per line, which is typically the case with base pair probability matrices of RNA sequences, we get a running time of O(n2) for each Lagrangian iteration. Since we fix the number of iterations, this leads to an overall time complexity of O(n2).
For the multiple case, similar in spirit to the MARNA software, we combine our pairwise method with the popular progressive alignment software T-COFFEE [50]. Progressive methods build multiple alignments from pairwise alignments. The pairwise distances are usually used to compute a guide tree which in turn determines the order in which the sequences are aligned to the evolving multiple alignment.
Progressive approaches often suffer from their sensitivity to the order in which the sequences are chosen during the alignment process. T-COFFEE reduces this effect by making use of local alignment information from all pairwise sequence alignments during its progressive alignment phase. We supply such local alignment information based on all-against-all structural alignments computed with our pairwise approach, assigning a high score to conserved interaction matches. The structural information is subsequently passed on to T-COFFEE that computes a multiple alignment, taking into account the additional structural information.