PicXAAR extends the idea of PicXAA, the multiple sequence alignment algorithm that maximizes the expected number of correctly aligned bases, to the structural alignment of RNA sequences. PicXAAR uses a greedy approach that builds up the alignment from sequence regions with high local similarities and high basepairing probabilities. Thus, it avoids the propagation of early stage alignment errors, usually observed in progressive techniques. The algorithm employs a probabilistic framework by utilizing both the intersequence base alignment probabilities and the intrasequence basepairing probabilities. The following subsections provide an overview of the proposed algorithm.
Preliminary
To align m RNA sequences in a set S = {s_{1}, ⋯ , s_{
m
} }, we need to compute the following probabilities.

P_{
a
}(x_{
i
} ~ y_{
j
} x, y): For each pair sequence x, y ∈ S, P_{
a
}( x_{
i
} ~ y_{
j
} x, y) is the probability that bases x_{
i
} ∈ x and y_{
j
} ∈ y are matched in the true (unknown) alignment. We can compute the posterior pairwise alignment probabilities using the pair hidden Markov model (PHMM) [40].

P_{
b
}(x_{
i
} ~ x_{
j
} x): For each sequence x ∈ S, P_{
b
}( x_{
i
} ~ x_{
j
} x) is the probability that two bases x_{
i
}, x_{
j
} ∈ x form a basepair. We can exploit different approaches, such as the McCaskill algorithm [41] or the CONTRAfold model [39], to compute the basepairing probabilities.
We use these probabilities in the following probabilistic structural alignment scheme.
Consistency transformation
Here, we use three types of probabilistic consistency transformations to modify the pairwise base alignment probabilities and basepairing probabilities using the information from other sequences in the alignment. This modification makes these posterior probabilities suitable for constructing a consistent and accurate structural alignment.
Intersequence probabilistic consistency transformation for base alignment probabilities
In the first consistency transformation, we incorporate the information from other sequences in the alignment to improve the estimation of pairwise base alignment probabilities. The motivation of this transformation is that all the pairwise alignments induced from a given MSA should be consistent with each other. This means that if position x_{
i
} (∈ x) aligns with position z_{
k
} (∈ z) in the x – z alignment, and if z_{
k
} aligns with position y_{
j
} (∈ y) in the z – y alignment, then x_{
i
} must align with y_{
j
} in the x – y alignment. We can thus utilize the “intermediate” sequence z to improve the x – y alignment by making it consistent with the alignments x – z and z – y.
Based on this motivation, we introduced an enhanced probabilistic consistency transformation in PicXAA [34], which improves the original transformation proposed by Do et al.[30]. The enhanced transformation modifies the alignment probability for a basepair x_{
i
} ~ y_{
j
} , by incorporating the alignment probability between x_{
i
} and z_{
k
} and that between z_{
k
} and y_{
j
}. This transformation can be written as:
where P(x◊z) represents the probability that x and z are homologous, defined as:
where ā is the optimal pairwise alignment of x and z.
This transformation improves the consistency of the x – y alignment with other pairwise alignments in the MSA, by incorporating information only from homologous sequences. In this way, we can obtain more probabilistically consistent estimate of the posterior alignment probabilities, which helps enhance the quality of the final MSA.
Intrasequence probabilistic consistency transformation for basepairing probabilities
In the second transformation, we incorporate the pairwise alignment information to the structural formation of the sequences. This transformation exploits this observation that the basepairings in each sequence should be consistent with the pairwise base alignments induced from a given structural alignment. This means that if positions y_{
j
} ~ y_{
j
}_{′} form a basepair in y, where x_{
i
} (∈ x) aligns with y_{
j
} (∈ y) and x_{
i
}_{′}, (∈ x) aligns with y_{
j
}_{′} (∈ y), then x_{
i
} ~ x_{
i
}_{′} must form a basepair in x. Thus, we can utilize the base alignment information to improve the estimation of the x_{
i
} ~ x_{
i
}_{′} basepairing probability.
Based on this observation, Kiryu et al.[12] introduced a transformation for basepairing probabilities, which was modified later in [42] as:
where α ∈ [0, 1] is a weight parameter between the target sequence x and rest of sequences. This transformation assumes that all sequences y ∈ S – {x} are homologous to the given sequence x. However, when we have a set of distantly related sequences in S, this assumption does not necessarily hold. To address this problem, here, we modify this transformation by improving the basepairing probability using the information just from the closely related sequences to the given sequence x. Therefore, like the intersequence consistency transformation, we explicitly consider the relative significance of each sequence y ∈ S – {x} in improving the basepairing probabilities in x.
Let Z = {y ∈ S – {x} x ◊ y} be the set of sequences in S – {x} that are homologous to x. The notation x ◊ y means x and y are homologous and functionally related to each other. Using only the relevant sequences, which are included in the set Z, we define this transformation as:
The second term in the right hand side of the above equation can be also written as:
using the identity function I{·}, where I{x ◊ y} = 1 if y is homologous to x, and I{x ◊ y} = 0 otherwise. In practice, we cannot judge with certainty whether two sequences are homologous or not. Thus, we describe this relationship probabilistically, using the expectation as: E [I{x ◊ y}] = P(x ◊ y), where P(x ◊ y) is the homology probability and can be estimated as described in the previous subsection. By replacing the identity functions with their expected values in the previous equation, we propose the following enhanced intrasequence probabilistic consistency transformation as:
Probabilistic fourway consistency transformation for base alignment probabilities
In the third consistency transformation, we incorporate the structural information to the pairwise alignments. This transformation is based on the same observation that motivated the intrasequence consistency transformation; that is, the pairwise base alignments induced from a given structural alignment should be consistent with the basepairings in the corresponding pair sequence. However, this time, we utilize the basepairing information to improve the x – y alignment.
Based on this motivation, Katoh and Toh introduced the fourway consistency transformation in [26] which was also latter implemented in [17]. We use this idea in a probabilistic fashion by incorporating the base alignment and the basepairing probabilities as in [17]. This transformation is defined as:
where β ∈ [0, 1] is a weight parameter.
Using the sparsity of alignment and pairing probability matrices, we can efficiently implement these three transformations successively. The intersequence consistency transformation has a complexity of O(µ^{2}Lm^{3}), the intrasequence transformation has a complexity of O(µ^{3}Lm^{2}), and the fourway consistency transformation has a computational complexity of O(µ^{4}Lm^{2}), where µ is the average number of nonzero elements per row (typically 1 ≤ µ ≤ 5 in real examples), m is the number of sequences, and L is the length of each sequence.
Constructing the structural alignment
To find a valid structural alignment of a set of RNA sequences, we propose a twostep greedy approach that builds up the alignment starting from those regions with higher basepairing and base alignment probabilities. The proposed greedy scheme extends the idea of PicXAA [34] to multiple RNA alignments. In PicXAA, we construct the multiple protein sequence alignment by successively inserting the most probable pairwise residue alignment into the final alignment. In the proposed algorithm, we add another step before the greedy graph construction step of PicXAA to better incorporate the secondary structure information in RNAs. This twostep alignment construction approach, along with intrasequence consistency transformation and fourway consistency transformation, described in the previous subsection, helps PicXAAR to effectively integrate both sequence and structural similarities to construct the final alignment. The proposed structural alignment approach is described in the following.
The greedy alignment approach we proposed in PicXAA [34] is conceptually similar to the one used in sequence annealing algorithms [29, 35, 36]. However, it should be noted that unlike sequence annealing, which greedily merges pairs of columns, we always add a single pairwise base alignment at a time, based on the consistencytransformed posterior alignment probabilities.
We represent the structural alignment as a directed acyclic graph G = (V, E) where, V is the set of vertices and E is the set of directed edges. Each vertex c^{(}^{i}^{)} ∈ V corresponds to a column in the final alignment, and each directed edge e = (c^{(}^{i}^{)}, c^{(}^{j}^{)}) ∈ E implies that column c^{(}^{i}^{)} precedes column c^{(}^{j}^{)} in the given alignment. Each column c^{(}^{i}^{)} ∈ V consists of positions from different sequences that will appear in the same column in the final alignment.
When inserting a new pairwise base alignment, we should consider the following requirements to obtain a legitimate multiple RNA alignment:

(Avoid Cycles) The alignment graph G should remain acyclic.

(LeftRight Compatibility) In the first greedy step where we use structural information, we should consider leftright compatibility. That is, for any paired columns (c, c′), if column c appears in the left part of the stem in the final structure, then for each base x_{
i
} ∈ c that pairs with some x_{i′} ∈ c′ of the same sequence x, we should have i <i′.
Thus, while we build up the alignment graph, we satisfy the structural constraints and alignment constraints by verifying whether the new inserted pairwise base alignment keeps the graph acyclic and leftright compatible.
The twostep alignment construction approach is as follows:
Step 1Structural skeleton construction
In the first alignment construction step, we greedily insert the most probable alignments of basepairs with high basepairing probability. To this aim, we define the ordered set B as
Here, B is the ordered set of basepairs whose transformed basepairing probability is larger than a threshold T_{
b
} . The basepairs in B are sorted in descending order according to their transformed basepairing probability, . We successively pick the most confident basepair (x_{
i
} , x_{
i
}_{′}) from B. For a selected basepair, we look for the best match among the members of B. That is, we seek for a pair (y_{
j
} , y_{
j
}_{′}) ∈ B which belongs to another sequence y and satisfies the two compatibility conditions above in G while maximizing the following probability:
For this pair (y_{
j
} , y_{j′}), we insert two pairwise alignments (x_{
i
} ~ y_{
j
} ) and (x_{i′} ~ y_{j′}) into the alignment graph G . Figure 1A illustrates this process.
Upon inserting a new pair p* = (x_{
i
} , y_{
j
} ) to G, three scenarios may occur: (1) New column addition; (2) Extension of an existing column; or (3) Merging of two columns. The detailed description of the procedures needed for each case can be found in [34]. Later in this section, we provide a summary of those procedures. By successively inserting the most probable alignment for confident basepairs, we construct the skeleton of the alignment enriched by structural information. Next, we complete this skeleton by greedily inserting highly probable base alignments.
Step 2Inserting highly probable local alignments
In this step, we update the skeleton alignment obtained in the previous step by successively inserting the most probable pairwise base alignments into the multiple structural alignment, as in PicXAA [34]. Thus, we sort all remaining pairwise alignments (x_{
i
} , y_{
j
} ) according to their transformed alignment probability in an ordered set A. We greedily build up G by repeatedly picking the most probable pair in A, which is not processed yet, provided that it is compatible with the current alignment. Again, insertion of any pair p* = (x_{
i
} , y_{
j
} ) to G will result in one of the scenarios of new column addition, extension of an existing column, or merging of two columns.
Here, we briefly discuss these three cases (For detailed description see [34]):

1.
New column addition: We insert a new compatible vertex c* = {x_{
i
} , y_{
j
} } in G if neither x_{
i
} nor y_{
j
} belongs to some existing column in G . Figure 1B illustrates this process.

2.
Extending an existing column: If only one of the bases in p*, let say x_{
i
} , belongs to some vertex c ∈ V, we should add the other base y_{
j
} to the same vertex c. Figure 1C illustrates this process.

3.
Merging two vertices: When x_{
i
} ∈ c _{1} and y_{
j
} ∈ c _{2} belong to two different vertices c _{1}, c _{2} ∈ V, we merge the vertices c _{1} and c _{2}. Figure 1D illustrates this process.
After updating the graph as described above, we prune G to avoid redundant edges, thereby improving the computational efficiency of the construction process.
Upon finishing the twostep graph construction, we use the obtained alignment graph G to find the multiple alignment. We use the depthfirst search algorithm to order the vertices in V in an ordered set A = (v_{1}, v_{2}, ⋯ , v_{
n
} ) such that there is no path from v_{
i
} to v_{
j
} in G for any i >j. In the resulted ordered set A, each member corresponds to a column in the alignment, and putting them together gives the alignment. Further details of the graph construction and alignment process can be found in [34]. An illustrative example for the graph construction process using PicXAAR can be found in Figure 2.
Discriminative refinement
As the final step, we apply a refinement step to improve the alignment quality in sequence regions with low alignment probability. We employ the iterative refinement strategy based on the discriminativesplitandrealignment technique that was introduced in PicXAA [34]. We repeat the following steps successively for each sequence x ∈ S:

1.
Find S_{
x
} ⊂ S, the set of similar sequences to x using the kmeans clustering.

2.
Align x with the profile of sequences in S_{
x
}.

3.
Perform the profileprofile alignment of and S – S_{
x
}.
This refinement strategy takes advantage of both the intrafamily similarity as well as the interfamily similarity, thereby improving the alignment quality in low similarity regions without breaking the confidently aligned bases.