Consensus folding energy and covariance score in RNAalifold
The basic approach of RNAalifold [25] is to integrate covarying mutation into the thermodynamic model to predict consensus structures. First, covariance scores are computed for all pairs of columns to determine possible pairing positions in the consensus structure. Then, based on the MFE model, the minimum average folding energy is computed with dynamic programming. Assume the given alignment is denoted by , which contains N sequences . Each sequence contains L symbols, including nucleotides and gaps, and represents the ith symbol (1≤i≤L) at the kth (1≤k≤N) RNA sequence. The minimization of free energy is computed by using the following recursive functions:
(1)
where Fi,j,Ci,j,F Mi,j,F M1i,j denote the minimum free energies for the region between ith column and jth column with unconstrained structure, with enclosed structure, with a multi-loop, and with a multi-loop containing a single branch, respectively. H(i,j,s) is the free energy for a hairpin loop enclosed by s
i
and s
j
, and I(i,j,p,q,s) is the free energy for an internal loop containing two base-pairs, one is between s
i
and s
j
and the other is between s
p
and s
q
. M
a
,M
c
are penalties for closing bases and non-pairing bases in multi-loops. M
b
is the bonus for branch bases in multi-loops.
The recursive functions were derived from the Turner’s model [35]. One major change made by RNAalifold for consensus folding is the usage of covariance score γ. It is not only a factor in the computing of free energy, but also determines the possible pairing columns in the alignment. Two parts, one is bonus and the other is penalty, are in this score. The first part of the covariance score is called the conservation score. For and , three levels of confidence for pairing are assessed: base-pairs without mutation, base-pairs with one mutation, and base-pairs with two mutations. In the latest version of Vienna RNA package (v2.0) [36], the recursive function for computing conservation score is:
(2)
where h(x,y) is the Hamming distance between base x and base y, and ={‘AU’, ‘UA’, ‘CG’, ‘GC’, ‘GU’, ‘UG’} is the set of all possible base-pairs. The second part is the penalty score Qi,j, which deals with a pair of symbols that cannot form a base-pair:
(3)
Overall, the covariance score is:
(4)
where ϕ1=ϕ2=1. A threshold value γ
t
=−2 is defined for γi,j. If γi,j>γ
t
, ith column and jth column are considered to be pairing columns. In the final output, the minimum average folding energy, including the covariance score, is normalized by dividing N.
Phylogenetic-based covarying mutation
RNAalifold incorporates covarying mutations into consensus folding to improve the detection of pairing columns. From Equation (2), it can be seen that RNAalifold counts the level of covariance by treating all sequences equally and try all possible combinations of base-pairs. In short, RNAalifold models the relationship of sequences as a complete graph. As a result, the specific evolutionary relationship among sequences in the phylogenetic history is ignored. Take the RNA structural alignment in Figure 1 as an example. The red and green columns achieve the same covariance score (2) in RNAalifold. However, as described in [37], the conservation evidence in Figure 1(c) is stronger than that in Figure 1(b) because at least two mutations occur at the green columns while only one is required to form the red ones.
PhyloRNAalifold models the relationship of aligned sequences as a tree by introducing the phylogenetic history of the alignment into the computation of covariance scores. The level of structural conservation is measured by the number of covarying mutations on the tree. Our assumption is that more covarying mutations on the tree mean stronger evidence of conservation. In addition, PhyloRNAalifold does not discard the original scoring scheme of RNAalifold, because experimental results showed this scheme can infer significant RNA structural aspects with high sensitivity and selectivity [38]. Assume mi,j covarying mutations occur between ith and jth columns on the alignment ’s phylogenetic tree and the number of base-pairs on those columns is bi,j. The value of mi,j depends on the size of the alignment. Since our approach focuses on improving the bonus part of the covariance scores, the number of covarying mutations is normalized with its upper bound: . A new factor for the conservation score is proposed:
(5)
where β is the scale parameter for the normalized covarying mutation numbers. PhyloRNAalifold computes covariance scores with the following formula:
(6)
All the other parameters and their default values in RNAalifold are retained. Due to the fact that (∊i,j≥1), two columns would be marked as pairing in PhyloRNAalifold if their covariance score in RNAalifold is greater than the threshold γ
t
(the default value of γ
t
is -2). Thus the advantage of PhyloRNAalifold is to import more potentially pairing positions with high mutation numbers.
Computing the number of covarying mutations
Given a phylogenetic tree and labels at its leaves, the Fitch algorithm can optimize nucleotide assignment of the internal nodes to minimize the number of mutations [39]. If we model solving phylogeny as a maximum parsimony problem, this number can be taken as the actual number of mutations. The Fitch algorithm consists of a forward phase and a backward phase. In the forward phase, all possible labels at each internal node are inferred. In addition, the number of mutations is estimated during a bottom-up traversal. In the backward phase, a top-down pass is performed to find the optimal label at each internal node. Only the forward algorithm is applied to PhyloRNAalifold, since we do not need the exact labels at the internal nodes, but only the number of mutations on the tree. Without loss of generality, we require to be a rooted binary tree. r denotes the root of and v, v
l
, v
r
denote a node, left child of v, and right child of v respectively. F(v) is the set of possible labels at node v, and c o s t(v) is the number of mutations on the sub-tree which is rooted at v. Then the forward phase can be described with the following recursive functions:
(7)
For each leaf, F(v) is a base at the corresponding sequence. After the computation is finished, c o s t(r) shows the minimum number of mutations on the phylogenetic tree. The optimization of this algorithm was proved in [40].
In Equation 5, the computation of ∊i,j does not depend on non-pairing bases. Therefore, in the revised Fitch algorithm non-pairing bases need not to be considered when the number of covarying mutations is computed. We changed the original Fitch algorithm in two ways: (1) at any leaf node, if , set (‘-’, ‘-’); (2) for one internal node v, if the bases at v
l
(v
r
) is (‘-’, ‘-’), v will obtain F(v
r
)(F(v
l
)) as its label. One example of this algorithm is shown in Figure 1(d). The revised Fitch algorithm can be described by using the following functions:
(8)
It is easy to see that our algorithm is optimal, because it only excludes non-pairing bases from the computation of the original Fitch algorithm.
In PhyloRNAalifold, the tree structure is an input variable and the clients can use any phylogenetic tree construction algorithm to build it. The time complexity of the original RNAalifold algorithm is O(m×n2+n3) [41], where n is the length of the alignment and m is the number of sequences in the alignment. The extra computation in PhyloRNAalifold is caused by the revised Fitch algorithm, whose time complexity ranges from O(logm) to O(m). In addition, PhyloRNAalifold needs to compute ∊i,j for each pair of columns in the alignment. Thus the overall time consumption of the revised Fitch algorithm is O(logm×n2) or O(m×n2). Neither of them increases the time complexity of RNAalifold.