PicXAA-R extends the idea of PicXAA, the multiple sequence alignment algorithm that maximizes the expected number of correctly aligned bases, to the structural alignment of RNA sequences. PicXAA-R uses a greedy approach that builds up the alignment from sequence regions with high local similarities and high base-pairing probabilities. Thus, it avoids the propagation of early stage alignment errors, usually observed in progressive techniques. The algorithm employs a probabilistic framework by utilizing both the inter-sequence base alignment probabilities and the intra-sequence base-pairing probabilities. The following subsections provide an overview of the proposed algorithm.
Preliminary
To align m RNA sequences in a set S = {s1, ⋯ , s
m
}, we need to compute the following probabilities.
-
P
a
(x
i
~ y
j
| x, y): For each pair sequence x, y ∈ S, P
a
( x
i
~ y
j
| x, y) is the probability that bases x
i
∈ x and y
j
∈ y are matched in the true (unknown) alignment. We can compute the posterior pairwise alignment probabilities using the pair hidden Markov model (PHMM) [40].
-
P
b
(x
i
~ x
j
| x): For each sequence x ∈ S, P
b
( x
i
~ x
j
| x) is the probability that two bases x
i
, x
j
∈ x form a base-pair. We can exploit different approaches, such as the McCaskill algorithm [41] or the CONTRAfold model [39], to compute the base-pairing probabilities.
We use these probabilities in the following probabilistic structural alignment scheme.
Consistency transformation
Here, we use three types of probabilistic consistency transformations to modify the pairwise base alignment probabilities and base-pairing probabilities using the information from other sequences in the alignment. This modification makes these posterior probabilities suitable for constructing a consistent and accurate structural alignment.
Inter-sequence probabilistic consistency transformation for base alignment probabilities
In the first consistency transformation, we incorporate the information from other sequences in the alignment to improve the estimation of pairwise base alignment probabilities. The motivation of this transformation is that all the pairwise alignments induced from a given MSA should be consistent with each other. This means that if position x
i
(∈ x) aligns with position z
k
(∈ z) in the x – z alignment, and if z
k
aligns with position y
j
(∈ y) in the z – y alignment, then x
i
must align with y
j
in the x – y alignment. We can thus utilize the “intermediate” sequence z to improve the x – y alignment by making it consistent with the alignments x – z and z – y.
Based on this motivation, we introduced an enhanced probabilistic consistency transformation in PicXAA [34], which improves the original transformation proposed by Do et al.[30]. The enhanced transformation modifies the alignment probability for a base-pair x
i
~ y
j
, by incorporating the alignment probability between x
i
and z
k
and that between z
k
and y
j
. This transformation can be written as:
where P(x◊z) represents the probability that x and z are homologous, defined as:
where ā is the optimal pairwise alignment of x and z.
This transformation improves the consistency of the x – y alignment with other pairwise alignments in the MSA, by incorporating information only from homologous sequences. In this way, we can obtain more probabilistically consistent estimate of the posterior alignment probabilities, which helps enhance the quality of the final MSA.
Intra-sequence probabilistic consistency transformation for base-pairing probabilities
In the second transformation, we incorporate the pairwise alignment information to the structural formation of the sequences. This transformation exploits this observation that the base-pairings in each sequence should be consistent with the pairwise base alignments induced from a given structural alignment. This means that if positions y
j
~ y
j
′ form a base-pair in y, where x
i
(∈ x) aligns with y
j
(∈ y) and x
i
′, (∈ x) aligns with y
j
′ (∈ y), then x
i
~ x
i
′ must form a base-pair in x. Thus, we can utilize the base alignment information to improve the estimation of the x
i
~ x
i
′ base-pairing probability.
Based on this observation, Kiryu et al.[12] introduced a transformation for base-pairing probabilities, which was modified later in [42] as:
where α ∈ [0, 1] is a weight parameter between the target sequence x and rest of sequences. This transformation assumes that all sequences y ∈ S – {x} are homologous to the given sequence x. However, when we have a set of distantly related sequences in S, this assumption does not necessarily hold. To address this problem, here, we modify this transformation by improving the base-pairing probability using the information just from the closely related sequences to the given sequence x. Therefore, like the inter-sequence consistency transformation, we explicitly consider the relative significance of each sequence y ∈ S – {x} in improving the base-pairing probabilities in x.
Let Z = {y ∈ S – {x}| x ◊ y} be the set of sequences in S – {x} that are homologous to x. The notation x ◊ y means x and y are homologous and functionally related to each other. Using only the relevant sequences, which are included in the set Z, we define this transformation as:
The second term in the right hand side of the above equation can be also written as:
using the identity function I{·}, where I{x ◊ y} = 1 if y is homologous to x, and I{x ◊ y} = 0 otherwise. In practice, we cannot judge with certainty whether two sequences are homologous or not. Thus, we describe this relationship probabilistically, using the expectation as: E [I{x ◊ y}] = P(x ◊ y), where P(x ◊ y) is the homology probability and can be estimated as described in the previous subsection. By replacing the identity functions with their expected values in the previous equation, we propose the following enhanced intra-sequence probabilistic consistency transformation as:
Probabilistic four-way consistency transformation for base alignment probabilities
In the third consistency transformation, we incorporate the structural information to the pairwise alignments. This transformation is based on the same observation that motivated the intra-sequence consistency transformation; that is, the pairwise base alignments induced from a given structural alignment should be consistent with the base-pairings in the corresponding pair sequence. However, this time, we utilize the base-pairing information to improve the x – y alignment.
Based on this motivation, Katoh and Toh introduced the four-way consistency transformation in [26] which was also latter implemented in [17]. We use this idea in a probabilistic fashion by incorporating the base alignment and the base-pairing probabilities as in [17]. This transformation is defined as:
where β ∈ [0, 1] is a weight parameter.
Using the sparsity of alignment and pairing probability matrices, we can efficiently implement these three transformations successively. The inter-sequence consistency transformation has a complexity of O(µ2Lm3), the intra-sequence transformation has a complexity of O(µ3Lm2), and the four-way consistency transformation has a computational complexity of O(µ4Lm2), where µ is the average number of non-zero elements per row (typically 1 ≤ µ ≤ 5 in real examples), m is the number of sequences, and L is the length of each sequence.
Constructing the structural alignment
To find a valid structural alignment of a set of RNA sequences, we propose a two-step greedy approach that builds up the alignment starting from those regions with higher base-pairing and base alignment probabilities. The proposed greedy scheme extends the idea of PicXAA [34] to multiple RNA alignments. In PicXAA, we construct the multiple protein sequence alignment by successively inserting the most probable pairwise residue alignment into the final alignment. In the proposed algorithm, we add another step before the greedy graph construction step of PicXAA to better incorporate the secondary structure information in RNAs. This two-step alignment construction approach, along with intra-sequence consistency transformation and four-way consistency transformation, described in the previous subsection, helps PicXAA-R to effectively integrate both sequence and structural similarities to construct the final alignment. The proposed structural alignment approach is described in the following.
The greedy alignment approach we proposed in PicXAA [34] is conceptually similar to the one used in sequence annealing algorithms [29, 35, 36]. However, it should be noted that unlike sequence annealing, which greedily merges pairs of columns, we always add a single pairwise base alignment at a time, based on the consistency-transformed posterior alignment probabilities.
We represent the structural alignment as a directed acyclic graph G = (V, E) where, V is the set of vertices and E is the set of directed edges. Each vertex c(i) ∈ V corresponds to a column in the final alignment, and each directed edge e = (c(i), c(j)) ∈ E implies that column c(i) precedes column c(j) in the given alignment. Each column c(i) ∈ V consists of positions from different sequences that will appear in the same column in the final alignment.
When inserting a new pairwise base alignment, we should consider the following requirements to obtain a legitimate multiple RNA alignment:
-
(Avoid Cycles) The alignment graph G should remain acyclic.
-
(Left-Right Compatibility) In the first greedy step where we use structural information, we should consider left-right compatibility. That is, for any paired columns (c, c′), if column c appears in the left part of the stem in the final structure, then for each base x
i
∈ c that pairs with some xi′ ∈ c′ of the same sequence x, we should have i <i′.
Thus, while we build up the alignment graph, we satisfy the structural constraints and alignment constraints by verifying whether the new inserted pairwise base alignment keeps the graph acyclic and left-right compatible.
The two-step alignment construction approach is as follows:
Step 1-Structural skeleton construction
In the first alignment construction step, we greedily insert the most probable alignments of base-pairs with high base-pairing probability. To this aim, we define the ordered set B as
Here, B is the ordered set of base-pairs whose transformed base-pairing probability is larger than a threshold T
b
. The base-pairs in B are sorted in descending order according to their transformed base-pairing probability,
. We successively pick the most confident base-pair (x
i
, x
i
′) from B. For a selected base-pair, we look for the best match among the members of B. That is, we seek for a pair (y
j
, y
j
′) ∈ B which belongs to another sequence y and satisfies the two compatibility conditions above in G while maximizing the following probability:
For this pair (y
j
, yj′), we insert two pairwise alignments (x
i
~ y
j
) and (xi′ ~ yj′) into the alignment graph G . Figure 1A illustrates this process.
Upon inserting a new pair p* = (x
i
, y
j
) to G, three scenarios may occur: (1) New column addition; (2) Extension of an existing column; or (3) Merging of two columns. The detailed description of the procedures needed for each case can be found in [34]. Later in this section, we provide a summary of those procedures. By successively inserting the most probable alignment for confident base-pairs, we construct the skeleton of the alignment enriched by structural information. Next, we complete this skeleton by greedily inserting highly probable base alignments.
Step 2-Inserting highly probable local alignments
In this step, we update the skeleton alignment obtained in the previous step by successively inserting the most probable pairwise base alignments into the multiple structural alignment, as in PicXAA [34]. Thus, we sort all remaining pairwise alignments (x
i
, y
j
) according to their transformed alignment probability
in an ordered set A. We greedily build up G by repeatedly picking the most probable pair in A, which is not processed yet, provided that it is compatible with the current alignment. Again, insertion of any pair p* = (x
i
, y
j
) to G will result in one of the scenarios of new column addition, extension of an existing column, or merging of two columns.
Here, we briefly discuss these three cases (For detailed description see [34]):
-
1.
New column addition: We insert a new compatible vertex c* = {x
i
, y
j
} in G if neither x
i
nor y
j
belongs to some existing column in G . Figure 1B illustrates this process.
-
2.
Extending an existing column: If only one of the bases in p*, let say x
i
, belongs to some vertex c ∈ V, we should add the other base y
j
to the same vertex c. Figure 1C illustrates this process.
-
3.
Merging two vertices: When x
i
∈ c 1 and y
j
∈ c 2 belong to two different vertices c 1, c 2 ∈ V, we merge the vertices c 1 and c 2. Figure 1D illustrates this process.
After updating the graph as described above, we prune G to avoid redundant edges, thereby improving the computational efficiency of the construction process.
Upon finishing the two-step graph construction, we use the obtained alignment graph G to find the multiple alignment. We use the depth-first search algorithm to order the vertices in V in an ordered set A = (v1, v2, ⋯ , v
n
) such that there is no path from v
i
to v
j
in G for any i >j. In the resulted ordered set A, each member corresponds to a column in the alignment, and putting them together gives the alignment. Further details of the graph construction and alignment process can be found in [34]. An illustrative example for the graph construction process using PicXAA-R can be found in Figure 2.
Discriminative refinement
As the final step, we apply a refinement step to improve the alignment quality in sequence regions with low alignment probability. We employ the iterative refinement strategy based on the discriminative-split-and-realignment technique that was introduced in PicXAA [34]. We repeat the following steps successively for each sequence x ∈ S:
-
1.
Find S
x
⊂ S, the set of similar sequences to x using the k-means clustering.
-
2.
Align x with the profile of sequences in S
x
.
-
3.
Perform the profile-profile alignment of
and S – S
x
.
This refinement strategy takes advantage of both the intra-family similarity as well as the inter-family similarity, thereby improving the alignment quality in low similarity regions without breaking the confidently aligned bases.