 Methodology article
 Open access
 Published:
Effective alignment of RNA pseudoknot structures using partition function posterior logodds scores
BMC Bioinformatics volumeÂ 16, ArticleÂ number:Â 39 (2015)
Abstract
Background
RNA pseudoknots play important roles in many biological processes. Previous methods for comparative pseudoknot analysis mainly focus on simultaneous folding and alignment of RNA sequences. Little work has been done to align two known RNA secondary structures with pseudoknots taking into account both sequence and structure information of the two RNAs.
Results
In this article we present a novel method for aligning two known RNA secondary structures with pseudoknots. We adopt the partition function methodology to calculate the posterior logodds scores of the alignments between bases or base pairs of the two RNAs with a dynamic programming algorithm. The posterior logodds scores are then used to calculate the expected accuracy of an alignment between the RNAs. The goal is to find an optimal alignment with the maximum expected accuracy. We present a heuristic to achieve this goal. The performance of our method is investigated and compared with existing tools for RNA structure alignment. An extension of the method to multiple alignment of pseudoknot structures is also discussed.
Conclusions
The method described here has been implemented in a tool named RKalign, which is freely accessible on the Internet. As more and more pseudoknots are revealed, collected and stored in public databases, we anticipate a tool like RKalign will play a significant role in data comparison, annotation, analysis, and retrieval in these databases.
Background
RNA pseudoknots are formed by pairing bases on singlestranded loops, such as hairpin and internal loops, with bases outside the loops [1,2]. They are often mingled with other RNA tertiary motifs [3], and are also found in noncoding RNAs [4,5]. RNA pseudoknots, with diverse functions [6,7], play important roles in many biological processes [8,9]; for example, they are required for telomerase activity [7], and have been shown to regulate the efficiency of ribosomal frameshifting in viruses [10].
Analysis and detection of RNA pseudoknots has been an active area of research. Many published articles in this area were focused on pseudoknot alignment [1114]. In this paper, we present a new approach, called RKalign, for RNA pseudoknot alignment. RKalign accepts as input two pseudoknotted RNAs where each RNA has both sequence data (i.e. nucleotides or bases) and structure data (i.e. base pairs), and produces as output an alignment between the two pseudoknotted RNAs. The structure data of a pseudoknotted RNA can be obtained from the literature or public databases [1518].
RKalign adopts the partition function methodology to calculate the posterior probabilities or logodds scores of structural alignments. The idea of using posterior probabilities to align biomolecules originated from [19,20] where the partition function methodology was employed to calculate the posterior probabilities of protein sequence alignments. Similar techniques were proposed by Do et al. [21] where the authors used hidden Markov models (HMMs) to calculate the posterior probabilities. Will et al. [22] extended the idea of [1921] to structurebased multiple RNA alignment where the authors calculated partition functions inside and outside of subsequence pairs on two pseudoknotfree RNAs. Here, we further extend this idea to pseudoknot alignment.
Several tools are available for RNA sequencestructure alignment [2325]. These tools do not deal with pseudoknots. Mohl et al. [26] proposed a method to perform sequencestructure alignment for RNA pseudoknots. The authors set up a pipeline for combining alignment and prediction of pseudoknots, and showed experimentally the effectiveness of this pipeline in pseudoknot structure annotation. Han et al. [27] decomposed embedded pseudoknots into simple pseudoknots and aligned them recursively. Yoon [28] used a profileHMM to establish sequence alignment constraints, and incorporated these constraints into an algorithm for aligning RNAs with pseudoknots. Wong et al. [29] identified the pseudoknot type of a given structure and developed dynamic programming algorithms for structural alignments of different pseudoknot types. Huang et al. [4] applied a tree decomposition algorithm to search for noncoding RNA pseudoknot structures in genomes.
The above methods were concerned with aligning a pseudoknot structure with a sequence or genome. Through the alignment, the sequence is folded and its structure is predicted. Xu et al. [11] presented a different method, called RNA Sampler, which can simultaneously fold and align two or multiple RNA sequences considering pseudoknots without known structures. Similar techniques were implemented in DAFS [12] and SimulFold [13]. Additional methods can be found in the CompaRNA web server [30]. In contrast to these methods, which perform alignment and folding at the same time, RKalign aims to align two known RNA pseudoknot structures where the structures are obtained from existing databases [1517]. As more pseudoknot structures become available in these databases, a tool like RKalign will be useful in performing data analysis in the repositories.
There are two groups of algorithms which are also capable of aligning two known RNA structures. The first group is concerned with aligning two RNA threedimensional (3D) structures, possibly containing pseudoknots. Ferre et al. [31] presented a dynamic programming algorithm by taking into account nucleotide, dihedral angle and basepairing similarities. Capriotti and MartiRenom [32] developed a program to align two RNA 3D structures based on a unitvector rootmeansquare approach. Chang et al. [33] and Wang et al. [34] employed a structural alphabet of different nucleotide conformations to align RNA 3D structures. Hoksza and Svozil [35] developed a pairwise comparison method based on 3D similarity of generalized secondary structure units. Rahrig et al. [36] presented the R3D Align tool for performing global pairwise alignment of RNA 3D structures using local superpositions. He et al. [37] developed the RASS web server for comparing RNA 3D structures using both sequence and 3D structure information. The above methods and tools were mainly designed for aligning two RNA tertiary structures by considering their geometric properties and torsion angles. In contrast, RKalign is used to align two RNA secondary structures with pseudoknots.
The second group of algorithms is concerned with aligning two RNA secondary structures without pseudoknots. These algorithms employed general editdistance alignment [38] or tree matching techniques [3941]. Jiang et al. [42] developed an approximation algorithm for aligning a pseudoknotfree structure with a pseudoknotted structure. Our work differs from Jiang et al.â€™s work in that we focus on the alignment of two pseudoknotted structures. Furthermore, we use the partition function methodology whereas Jiang et al. adopted a general editdistance approach to the structural alignment.
The method that is most closely related to ours is an option offered by the CARNA tool [14]. Like RKalign, this option is able to accept two known RNA secondary structures with pseudoknots, and produce an alignment between the two RNA structures. This option employs constraint programming techniques with a branch and bound scheme. It gradually refines solutions until the best solution is found. To understand the relative performance of the two tools, we perform extensive experiments to compare RKalign with CARNA using different datasets.
Methods
In this section, we present algorithmic details of RKalign. To align two RNA pseudoknot structures A and B, we adopt the partition function methodology to calculate the posterior probabilities or logodds scores of the alignments between bases or base pairs in A and B respectively. After calculating the posterior logodds scores, we then compute the expected accuracy of an alignment between structure A and structure B. The goal is to find an optimal alignment between A and B where the alignment has the maximum expected accuracy. We will present a heuristic to achieve this goal.
Definitions and notation
Suppose (i, j) is a base pair of pseudoknot structure A and (p, q) is a base pair of pseudoknot structure B. We use score((i, j), (p, q)) to represent the score of aligning (i, j) with (p, q) where the score is obtained from the logodds RIBOSUM matrix [43]. The use of this scoring matrix permits RKalign to determine the similarity between pseudoknot structures that contain compensatory base changes. With this scoring matrix, RKalign is able to handle noncanonical base pairs. Aligning a single base with a base pair is prohibited by RKalign.
Suppose structure A has m nucleotides, i.e. the length of A is m, and structure B has n nucleotides, i.e. the length of B is n. We use A[c _{1}, c _{2}] where 1â€‰â‰¤â€‰c _{1}â€‰â‰¤â€‰c _{2}â€‰â‰¤â€‰m to represent the portion of A that begins at position c _{1} and ends at position c _{2} inclusively. We use B[d _{1}, d _{2}] where 1â€‰â‰¤â€‰d _{1}â€‰â‰¤â€‰d _{2}â€‰â‰¤â€‰n to represent the portion of B that begins at position d _{1} and ends at position d _{2} inclusively. We use A[c] to represent the nucleotide and secondary structure at position c of A, and B[d] to represent the nucleotide and secondary structure at position d of B.
Partition function computation
Suppose (i, j)â€‰âˆˆâ€‰A is aligned with (p, q)â€‰âˆˆâ€‰B. Let Z _{ c,d } (\( {Z}_{c,\ d}^{\prime } \) respectively) represent the partition function of all alignments between A[1, c] (A[c, m] respectively) and B[1, d] (B[d, n] respectively). Let \( {Z^{\prime}}_{c,d}^{\prime } \) represent the partition function of all alignments between A[iâ€‰+â€‰1, c] and B[pâ€‰+â€‰1, d]. We focus on the case in which both (i, j) and (p, q) are base pairs. The case for aligning single bases is simpler, and thus omitted.
First, we show how to calculate Z _{ c,d } where 1 â‰¤â€‰c <â€‰i and 1 â‰¤â€‰d <â€‰p. There are three cases to be considered: (i) A[c] is aligned with B[d]; (ii) B[d] is aligned to a gap; and (iii) A[c] is aligned to a gap. Let \( {Z}_{c,d}^M \) represent the partition function of all alignments between A[1, c] and B[1, d] where A[c] is aligned with B[d]. Let \( {Z}_{c,d}^E \) represent the partition function of all alignments between A[1, c] and B[1, d] where B[d] is aligned to a gap. Let \( {Z}_{c,d}^F \) represent the partition function of all alignments between A[1, c] and B[1, d] where A[c] is aligned to a gap. Then Z _{ c,d } can be calculated by Equation (1).
We ignore and skip the computation of Z _{ c,d } when A[c] or B[d] is the left base of some base pair. If A[c] (B[d], respectively) is a single base and B[d] (A[c], respectively) is the right base of some base pair, \( {Z}_{c,d}^M=0 \). Otherwise, let A[c] be the right base of some base pair (x, c) and let B[d] be the right base of some base pair (y, d). Following [20], \( {Z}_{c,d}^M \) can be calculated by Equation (2).
Here T is a constant, and score((x, c), (y, d)) is obtained from the RIBOSUM8560 matrix [43]. Thus, the partition function \( {Z}_{c,d}^M \) can be computed recursively by dynamic programming as follows:
When calculating \( {Z}_{c,d}^E \), since B[d] is aligned to a gap, we know that A[c] must be aligned with B[d1]. Therefore,
where score(âˆ’,â€‰(y,â€‰d)) is the gap penalty value obtained by aligning base pair (y, d) to gaps. Thus,
When calculating \( {Z}_{c,d}^F \), since A[c] is aligned to a gap, B[d] must be aligned with A[c1]. Therefore,
where score((x, c),âˆ’) is the gap penalty value obtained by aligning base pair (x, c) to gaps. Thus,
Next, we show how to calculate \( {Z}_{c,\ d}^{\prime } \) where j <â€‰c â‰¤â€‰m and q <â€‰d â‰¤â€‰n. There are three cases to be considered: (i) A[c] is aligned with B[d]; (ii) B[d] is aligned to a gap; and (iii) A[c] is aligned to a gap. Let \( {Z^{\prime}}_{c,d}^M \) represent the partition function of all alignments between A[c, m] and B[d, n] where A[c] is aligned with B[d]. Let \( {Z^{\prime}}_{c,d}^E \) represent the partition function of all alignments between A[c, m] and B[d, n] where B[d] is aligned to a gap. Let \( {Z^{\prime}}_{c,d}^F \) represent the partition function of all alignments between A[c, m] and B[d, n] where A[c] is aligned to a gap. Then \( {Z}_{c,\ d}^{\prime } \) can be calculated by Equation (8).
We ignore the computation of \( {Z}_{c,\ d}^{\prime } \) when A[c] or B[d] is the right base of some base pair. If A[c] (B[d], respectively) is a single base and B[d] (A[c], respectively) is the left base of some base pair, \( {Z^{\prime}}_{c,d}^M=0 \). Otherwise, let A[c] be the left base of some base pair (c, x) and let B[d] be the left base of some base pair (d, y). Following [20], \( {Z^{\prime}}_{c,d}^M \) can be calculated by Equation (9).
When calculating \( {Z^{\prime}}_{c,d}^E \), since B[d] is aligned to a gap, A[c] must be aligned with B[dâ€‰+â€‰1]. Therefore,
When calculating \( {Z^{\prime}}_{c,d}^F \), since A[c] is aligned to a gap, B[d] must be aligned with A[câ€‰+â€‰1]. Therefore,
Finally, we show how to calculate \( {Z^{\prime}}_{c,d}^{\prime } \) where i <â€‰câ€‰<â€‰j and pâ€‰<â€‰dâ€‰<â€‰q. There are three cases to be considered: (i) A[c] is aligned with B[d]; (ii) B[d] is aligned to a gap; and (iii) A[c] is aligned to a gap. Let \( {{Z^{\prime}}^{\prime}}_{c,d}^M \) represent the partition function of all alignments between A[iâ€‰+â€‰1, c] and B[pâ€‰+â€‰1, d] where A[c] is aligned with B[d]. Let \( {{Z^{\prime}}^{\prime}}_{c,d}^E \) represent the partition function of all alignments between A[iâ€‰+â€‰1, c] and B[pâ€‰+â€‰1, d] where B[d] is aligned to a gap. Let \( {{Z^{\prime}}^{\prime}}_{c,d}^F \) represent the partition function of all alignments between A[iâ€‰+â€‰1, c] and B[pâ€‰+â€‰1, d] where A[c] is aligned to a gap. Then \( {Z^{\prime}}_{c,d}^{\prime } \) can be calculated by Equation (12).
We ignore the computation of \( {Z^{\prime}}_{c,d}^{\prime } \) when A[c] or B[d] is the left base of some base pair. If A[c] (B[d], respectively) is a single base and B[d] (A[c], respectively) is the right base of some base pair, \( {{Z^{\prime}}^{\prime}}_{c,d}^M=0 \). Otherwise, let A[c] be the right base of some base pair (x, c) and let B[d] be the right base of some base pair (y, d). If xâ€‰<â€‰iâ€‰+â€‰1 or yâ€‰<â€‰pâ€‰+â€‰1, we ignore and skip the computation of \( {Z^{\prime}}_{c,d}^{\prime } \) . We consider only the case where xâ€‰â‰¥ iâ€‰+â€‰1 and yâ€‰â‰¥ pâ€‰+â€‰1. Following [20], \( {{Z^{\prime}}^{\prime}}_{c,d}^M \) can be calculated by Equation (13).
When calculating \( {{Z^{\prime}}^{\prime}}_{c,d}^E \), since B[d] is aligned to a gap, A[c] must be aligned with B[d1]. Therefore,
When calculating \( {{Z^{\prime}}^{\prime}}_{c,d}^F \), since A[c] is aligned to a gap, B[d] must be aligned with A[c1]. Therefore,
Calculation of posterior logodds scores
There are four cases to be considered when calculating the posterior probability or logodds score of aligning base pair (i, j) of structure A with base pair (p, q) of structure B, denoted by Prob((i, j)â€‰~â€‰(p, q)).
Case 1
Base pair (i, j) doesnâ€™t cross another base pair and (p, q) doesnâ€™t cross another base pair. That is, for any base pair (u, v), iâ€‰<â€‰vâ€‰<â€‰j if and only if iâ€‰<â€‰uâ€‰<â€‰j. Furthermore, for any base pair (x, y), pâ€‰<â€‰yâ€‰<â€‰q if and only if pâ€‰<â€‰xâ€‰<â€‰q. Consequently, the alignment between structure A and structure B can be divided into the following three parts: (i) the alignment between A[1, i â€“ 1] and B[1, p â€“ 1]; (ii) the alignment between A[iâ€‰+â€‰1, j â€“ 1] and B[pâ€‰+â€‰1, q â€“ 1]; and (iii) the alignment between A[jâ€‰+â€‰1, m] and B[qâ€‰+â€‰1, n]. Following [20] we get
Case 2
Base pair (i, j) crosses another base pair whereas (p, q) doesnâ€™t cross another base pair. That is, there exists a base pair (u, v) in A such that (i) iâ€‰<â€‰vâ€‰<â€‰j and uâ€‰<â€‰i, or (ii) iâ€‰<â€‰uâ€‰<â€‰j and vâ€‰>â€‰j. Furthermore, for any base pair (x, y), pâ€‰<â€‰yâ€‰<â€‰q if and only if pâ€‰<â€‰xâ€‰<â€‰q. In this case, (i, j) crosses (u, v), which forms a pseudoknot in structure A, while (p, q) doesnâ€™t form a pseudoknot in structure B.
When (i) is true, since uâ€‰<â€‰i, we have 1â€‰â‰¤â€‰uâ€‰â‰¤â€‰iâ€‰âˆ’â€‰1. Furthermore, since vâ€‰>â€‰iâ€‰>â€‰iâ€‰âˆ’â€‰1, (u, v) is ignored when calculating Z _{ iâ€‰âˆ’â€‰1,pâ€‰âˆ’â€‰1} in Equation (16). In addition, since uâ€‰<â€‰iâ€‰<â€‰iâ€‰+â€‰1, (u, v) is ignored when calculating \( {Z^{\prime}}_{j1,q1}^{\prime } \) in Equation (16). Base pair (u, v) will be considered when calculating Prob ((u, v) ~ (p, q)). Thus, our algorithm doesnâ€™t miss the calculation of the posterior logodds score of aligning any two base pairs from structure A and structure B respectively.
When (ii) is true, since vâ€‰>â€‰j, we have jâ€‰+â€‰1â€‰â‰¤ vâ€‰â‰¤â€‰m. Furthermore, since uâ€‰<â€‰jâ€‰<â€‰jâ€‰+â€‰1, (u, v) is ignored when calculating \( {Z}_{j+1,q+1}^{\prime } \) in Equation (16). Base pair (u, v) will be considered when calculating Prob ((u, v) ~ (p, q)).
Case 3
Base pair (p, q) crosses another base pair whereas (i, j) doesnâ€™t cross another base pair. This case is similar to Case 2 above.
Case 4
Base pair (i,j) crosses another base pair and (p,q) also crosses another base pair. That is, there exists a base pair (u,v) in A such that (i) iâ€‰<â€‰vâ€‰<â€‰j and uâ€‰<â€‰i, or (ii) iâ€‰<â€‰uâ€‰<â€‰j and vâ€‰>â€‰j. Furthermore, there exists a base pair (x, y) in B such that (iii) pâ€‰<â€‰yâ€‰<â€‰q and xâ€‰<â€‰p, or (iv) pâ€‰<â€‰xâ€‰<â€‰q and yâ€‰>â€‰q. In this case, (i,j) crosses (u,v), which forms a pseudoknot in structure A. Furthermore (p,q) crosses (x,y), which also forms a pseudoknot in structure B.
When (i) and (iii) are true, (u, v) is ignored when calculating Z _{ iâ€‰âˆ’â€‰1,pâ€‰âˆ’â€‰1} and \( {Z^{\prime}}_{j1,q1}^{\prime } \) as discussed in Case 2 (i). Moreover, (x, y) is also ignored when calculating Z _{ iâ€‰âˆ’â€‰1,pâ€‰âˆ’â€‰1} and \( {Z^{\prime}}_{j1,q1}^{\prime } \) (Case 3). When (i) and (iv) are true, (u, v) is ignored when calculating Z _{ iâ€‰âˆ’â€‰1,pâ€‰âˆ’â€‰1} and \( {Z^{\prime}}_{j1,q1}^{\prime } \) (Case 2 (i)); (x, y) is also ignored when calculating \( {Z}_{j+1,q+1}^{\prime } \) (Case 3). When (ii) and (iii) are true, (u, v) is ignored when calculating Z ^{'} _{ jâ€‰+â€‰1,qâ€‰+â€‰1} (Case 2 (ii)); (x, y) is also ignored when calculating Z _{ iâ€‰âˆ’â€‰1,pâ€‰âˆ’â€‰1} and \( {Z^{\prime}}_{j1,q1}^{\prime } \) (Case 3). When (ii) and (iv) are true, (u, v) is ignored when calculating \( {Z}_{j+1,q+1}^{\prime } \) (Case 2 (ii)); (x, y) is also ignored when calculating \( {Z}_{j+1,q+1}^{\prime } \) (Case 3).
When both (i, j) and (p, q) are single bases, i.e. i = j and p = q, the value of \( {Z^{\prime}}_{j1,q1}^{\prime } \) in Equation (16) is defined as 1, and we use the same formula in Equation (16) to calculate Prob((i, j)â€‰~â€‰(p, q)).
From the above discussions, Equation (16) can be used to calculate the posterior logodds score of aligning two bases or base pairs with a dynamic programming algorithm. Furthermore, the algorithm doesnâ€™t miss the calculation of the posterior logodds score of aligning any two bases or base pairs from structure A and structure B respectively.
Pairwise alignment
Let a _{ A,B } be an alignment between structure A and structure B. The expected accuracy of a _{ A,B,} denoted \( \overline{Accu}\left({a}_{A,B}\right) \), is defined as follows [21]:
where ((i, j)â€‰~â€‰(p, q)â€‰âˆˆâ€‰a _{ A,B }) means (i, j)â€‰âˆˆâ€‰A is aligned with (p, q)â€‰âˆˆâ€‰B in a _{ A,B }, Prob((i, j)â€‰~â€‰(p, q)) is the posterior logodds score of aligning (i, j)â€‰âˆˆâ€‰A with (p, q)â€‰âˆˆâ€‰B as defined in Equation (16), and h (k respectively) is the number of single bases plus the number of base pairs in A (B respectively).
An optimal alignment between structure A and structure B is an alignment with the maximum expected accuracy. We present here a heuristic to find a (sub)optimal alignment. From the previous subsection, we are able to construct the posterior logodds score matrix for aligning structure A with structure B where the matrix contains Prob((i, j)â€‰~â€‰(p, q)) for all (i, j)â€‰âˆˆâ€‰A and (p, q)â€‰âˆˆâ€‰B. Our heuristic is an iterative procedure. In the first step, we select two bases or base pairs with the largest score from this matrix to build the first alignment line between A and B where the alignment line connects the selected bases or base pairs. Then, we select the second largest score from the matrix to construct the next alignment line provided that the newly constructed alignment line satisfies the following two constraints:

(1)
A base (base pair, respectively) can be aligned with at most one base (base pair, respectively).

(2)
The newly constructed alignment lines do not cross the alignment lines built in the previous steps. Specifically, suppose (i, j) is aligned with (p, q) and (iâ€‰',â€‰jâ€‰') is aligned with (pâ€‰',â€‰qâ€‰'). The alignment lines between (i, j) and (p, q) do not cross the alignment lines between (iâ€‰',â€‰jâ€‰') and (pâ€‰',â€‰qâ€‰') if and only if the following conditions hold: (i) iâ€‰'â€‰<â€‰i iff p ^{'}â€‰<â€‰p, (ii) i <â€‰iâ€‰'â€‰<â€‰j iff p <â€‰pâ€‰'â€‰<â€‰q, (iii) iâ€‰'â€‰>â€‰j iff pâ€‰'â€‰>â€‰q, (iv) jâ€‰'â€‰<â€‰i iff qâ€‰'â€‰<â€‰p, (v) i <â€‰jâ€‰'â€‰<â€‰j iff p <â€‰qâ€‰'â€‰<â€‰q and (vi) jâ€‰'â€‰>â€‰j iff qâ€‰'â€‰>â€‰q.
If the newly constructed alignment line violates the above constraints, it is discarded. We repeat the above steps until the smallest posterior logodds score in the matrix is considered. If there are still bases or base pairs that are not aligned yet, these remaining bases or base pairs are aligned to gaps.
Time and space complexity
In calculating Prob((i, j)â€‰~â€‰(p, q)), we need to compute Z _{ iâ€‰âˆ’â€‰1,pâ€‰âˆ’â€‰1}, \( {Z^{\prime}}_{j1,q1}^{\prime } \) and Z ^{'} _{ jâ€‰+â€‰1,qâ€‰+â€‰1}; cf. Equation (16). Computing Z _{ iâ€‰âˆ’â€‰1,pâ€‰âˆ’â€‰1}, \( {Z^{\prime}}_{j1,q1}^{\prime } \) and \( {Z}_{j+1,q+1}^{\prime } \) requires O(mn) time. Since we need to calculate Prob((i, j)â€‰~â€‰(p, q)) for all (i, j)â€‰âˆˆâ€‰A and (p, q)â€‰âˆˆâ€‰B, the time complexity of the pairwise alignment algorithm is O(m ^{2} n ^{2}). At any moment, we maintain a twodimensional matrix for storing Z _{ iâ€‰âˆ’â€‰1,pâ€‰âˆ’â€‰1}, \( {Z^{\prime}}_{j1,q1}^{\prime } \) and \( {Z}_{j+1,q+1}^{\prime } \), which requires O(mn) space. Since the total number of bases and base pairs in structure A (B respectively) is at most m (n respectively), we use a twodimensional matrix to store Prob((i, j)â€‰~â€‰(p, q)), which also requires O(mn) space. Thus, the space complexity of the algorithm is O(mn). Notice that the time complexity derived here is a very pessimistic upper bound since in calculating the partition functions, some base pairs are ignored as described in the previous subsections. During our experiments, we tested over 200 alignments and the running times of our algorithm ranged from 16 ms to roughly 7 minutes, where the lengths of the aligned structures ranged from 22 nt to 1,553 nt.
Experimental design
Datasets
RKalign is implemented in Java. The program accepts as input two pseudoknotted RNAs where each RNA has both sequence data (i.e. nucleotides or bases) and structure data (i.e. base pairs), and produces as output an alignment between the two pseudoknotted RNAs. Popular benchmark datasets such as BRAliBase [44], RNase P [45] and Rfam [46] are not suitable for testing RKalign. The reason is that BRAliBase contains only sequence information, while RNase P and Rfam contain consensus structures of multiple sequence alignments rather than alignments of individual structures of RNAs. As a consequence, we manually created two datasets for testing RKalign and comparing it with related alignment methods.
The first dataset, denoted Dataset1, contains 38 RNA pseudoknot structures chosen from the PDB [16] and RNA STRAND [15] (see Additional file 1: Table S1). These RNAs were selected in such a way that they have a wide range of sequence lengths. Each threedimensional (3D) molecule in this dataset was taken from the PDB. The secondary structure of the 3D molecule was obtained with RNAview [47], retrieved from RNA STRAND. The second dataset, denoted Dataset2, contains 36 RNA pseudoknot structures chosen from PseudoBase [17,18] (see Additional file 1: Table S2). As in the first dataset, the RNA molecules in the second dataset have a wide range of sequence lengths. The pseudoknots in these datasets can be broadly classified into two types: Htype and recursive pseudoknots [8,29]. There are 12 Htype pseudoknots and 26 recursive pseudoknots in Dataset1. There are 22 Htype pseudoknots and 14 recursive pseudoknots in Dataset2.
Alignment quality
A good structural alignment tends to align a base pair with another base pair rather than with two single bases [35,36]. We therefore use the base_mismatch ratio to assess the quality of an alignment. A base mismatch occurs when a single base is aligned with the left or right base of a base pair or when a nucleotide is aligned to a gap. The base_mismatch ratio of an alignment a _{ A,B } between structure A and structure B is defined as the number of base mismatches in a _{ A,B } divided by the total number of alignment lines in a _{ A,B }, multiplied by 100%. Statistically significant performance differences between alignment methods are calculated using Wilcoxon signed rank tests [48], which are commonly used for comparing alignment programs [4951]. As in [4951] we consider pvalues below 0.05 to be statistically significant.
Results
We conducted a series of experiments to evaluate the performance of RKalign and compare it with related methods, where the performance measure used was the base_mismatch ratio. In the first experiment, we selected 106 pairs of RNA pseudoknot structures from Dataset1 and applied our method to aligning the two molecules in each pair. The two molecules in a pair belonged to the same pseudoknot type, as it is biologically meaningless to align RNA molecules that lack consensus [35,52]. The average base_mismatch ratio calculated by RKalign for the selected 106 pairs was 34.84%, compared to the average base_mismatch ratio, 78.53%, for all pairs of molecules in Dataset1.
In addition, we also ran CARNA [14], RNA Sampler [11], DAFS [12], R3D Align [53] and RASS [37] on the 106 pairs of molecules. The CARNA tool was chosen because an option of the tool is closely related to RKalign, both of which can align known pseudoknot structures. RNA Sampler and DAFS were chosen because they are widely used tools capable of simultaneously folding and aligning RNA sequences considering pseudoknots without known structures. When running these two tools, the structure information in Dataset1 was ignored and only the sequence data was used as the input of the tools. R3D Align and RASS were chosen because they are stateoftheart RNA 3D alignment programs; furthermore, like RKalign, R3D Align and RASS output the entire alignment of two RNA structures. Since R3D Align and RASS accept 3D structures as input whereas RKalign and CARNA accept bases and base pairs as input, we used the PDB files in Dataset1 as the input for R3D Align and RASS while using the corresponding RNA STRAND entries in Dataset1 as the input for RKalign and CARNA. The 106 pairwise alignments produced by RKalign can be found in Additional file 2.
Figure 1 presents histograms for the base_mismatch ratios of the six tools. Figure 2 presents boxplots for the base_mismatch ratios of the six tools. These figures show the distribution of the base_mismatch ratios for the six tools. RKalign and CARNA were not statistically different according to a Wilcoxon signed rank test (pâ€‰>â€‰0.05). On the other hand, they both were significantly better than the other four tools according to the Wilcoxon signed rank test (pâ€‰<â€‰0.05). It was observed that the structures predicted by RNA Sampler and DAFS might not be correct. Consequently, there were many base mismatches with respect to the known structures in the alignments.
For example, consider Figure 3, which shows the alignment result of DAFS, R3D Align and RKalign respectively on two pseudoknot structures with PDB IDs 1L2X and 1RNK. The base_mismatch ratio of DAFS (R3D Align, RKalign, respectively) is 57.14% (67.39%, 27.78%, respectively). Figure 3(a) shows the predicted common secondary structure and the alignment produced by DAFS. Figure 3(b) shows the known secondary structures of 1L2X and 1RNK and the alignment produced by DAFS where the known secondary structures are used to calculate the base_mismatch ratios. Figure 3(c) shows the alignment obtained from R3D Align and Figure 3(d) shows the alignment obtained from RKalign. It can be seen that the predicted common secondary structure in Figure 3(a) is quite different from the known secondary structure of 1L2X. Refer to Figure 3(b). The base G (G, C, C, A, A and A respectively) at position 1 (2, 8, 22, 23, 24 and 25 respectively) in 1L2X is a single base, which is aligned with the left or right base of some base pair in 1RNK, leading to base mismatches in the alignment. Similarly, the base G (A, C, A and U respectively) at position 7 (20, 21, 24 and 34 respectively) in 1RNK is a single base, which is aligned with the left or right base of some base pair in 1L2X. R3D Align doesnâ€™t align the pseudoknot structures well either, due to the fact that many gaps are involved in the alignment (Figure 3(c)). In this example, RKalign produces the best alignment (Figure 3(d)). It should be pointed out, however, that 3D alignment programs such as R3D Align are generalpurpose structure alignment tools capable of comparing two RNA 3D molecules with diverse tertiary motifs, whereas RKalign focuses on secondary structures with pseudoknots only.
In the second experiment, we compared RKalign, CARNA, RNA Sampler and DAFS using the RNA structures in Dataset2. As in the first experiment, we selected 124 pairs of molecules from Dataset2 where the two molecules in a pair belonged to the same pseudoknot type. The average base_mismatch ratio calculated by RKalign for the selected 124 pairs was 35.89%, compared to the average base_mismatch ratio, 81.56%, for all pairs of molecules in Dataset2. We applied each of the four tools to the molecules to produce 124 pairwise alignments. The 124 alignments produced by RKalign can be found in Additional file 3.
Figure 4 presents histograms for the base_mismatch ratios of the four tools. Figure 5 presents boxplots for the base_mismatch ratios of the four tools. These figures show the distribution of the base_mismatch ratios for the four tools. RKalign and CARNA were not statistically different (Wilcoxon signed rank test, pâ€‰>â€‰0.05); both tools were significantly better than RNA Sampler and DAFS (Wilcoxon signed rank test, pâ€‰<â€‰0.05).
Based on the above experimental results, there is no statistically significant difference between RKalign and CARNA in terms of base_mismatch ratios. As described in [4,54], a good pseudoknot alignment has many matched stems and few mismatched stems. In the last experiment, we further compared RKalign with CARNA by examining how they match stems in two pseudoknot structures A and B. A stem s _{ A } âˆˆ A is said to match a stem s _{ B } âˆˆ B if (i) s _{ A } ,s _{ B } are aligned together and they cannot be aligned with other stems, and (ii) for every base pair xâ€‰âˆˆâ€‰s _{ A } and base pair yâ€‰âˆˆ s _{ B }, a base of x is aligned with a base of y if and only if the other base of x is aligned with the other base of y; otherwise, there is a stem mismatch between s _{ A } and s _{ B } . The stem_mismatch ratio of an alignment a _{ A,B } between structure A and structure B is defined as (1 â€“ M) where M is the number of matched stems in a _{ A,B } divided by the total number of stems in A and B, multiplied by 100%.
Figure 6 shows the average stem_mismatch ratios of RKalign and CARNA obtained by running the tools on Dataset1 and Dataset2 respectively. RKalign was significantly better than CARNA (Wilcoxon signed rank test, pâ€‰<â€‰0.05). A close look at the alignment results of CARNA reveals why this happens. For instance, consider Figure 7(a), which shows how CARNA aligns the two PDB structures, 1L2X and 1RNK, given in Figure 3. Figure 7(b) illustrates mismatched stems in the alignment in Figure 7(a). Figure 7(c) shows the alignment of the same molecules, 1L2X and 1RNK, produced by RKalign where there is no stem mismatch. Refer to Figure 7(b). In 1L2X, there are two stems, highlighted in blue and red respectively. In 1RNK, there are also two stems, highlighted in blue and red respectively. In 1L2X, the base G at position 7 and the base C at position 14 form a base pair in its blue stem. In 1RNK, the base U at position 13 and the base G at position 28 form a base pair in its red stem.
Now, observe that the base C at position 14 in 1L2X is aligned with the base U at position 13 in 1RNK, but the base G at position 7 in 1L2X is not aligned with the base G at position 28 in 1RNK; instead the base G at position 7 in 1L2X is aligned with the single base G at position 7 in 1RNK. Thus, there is a stem mismatch between the blue stem of 1L2X and the red stem of 1RNK, a situation that is not favored when performing pseudoknot alignment [4,54]. This situation occurs more frequently in CARNA alignment results than in RKalign alignment results. As a consequence, CARNA has much higher stem_mismatch ratios than RKalign.
Comparing Figure 7(a) and Figure 7(c), we also note that the overall alignments produced by CARNA and RKalign are quite different. In Figure 7(a) in which the alignment from CARNA is shown, the base G at position 6 and the base C at position 15 form a base pair in 1L2X. The base A at position 6 in 1RNK is a single base. It can be seen that the base G at position 6 in 1L2X is aligned with the base A at position 6 in 1RNK, i.e., a base pair is aligned with a single base. In addition, in 1L2X the base C at position 14, which is the right base of a base pair, is aligned with the base U at position 13, which is the left base of a base pair in 1RNK. Aligning a base pair with a single base, and aligning the right base of a base pair with the left base of another base pair, occur in CARNAâ€™s output shown in Figure 7(a), but do not occur in RKalignâ€™s output shown in Figure 7(c). On the other hand, there are more gaps in RKalignâ€™s output than in CARNAâ€™s output; specifically there are 10 gaps in RKalignâ€™s output shown in Figure 7(c) compared to 8 gaps in CARNAâ€™s output shown in Figure 7(a).
Discussion and conclusions
In this paper, we present a novel method (RKalign) for comparing two known RNA pseudoknot structures. The method adopts the partition function methodology to calculate the posterior logodds scores of the alignments between bases or base pairs of the RNAs with a dynamic programming algorithm. The posterior logodds scores are then used to calculate the expected accuracy of an alignment between the RNAs. The goal is to find an optimal alignment with the maximum expected accuracy. We present a heuristic to achieve this goal. Experimental results demonstrate the good performance of the proposed RKalign method.
New pseudoknotted structures are found periodically, as exemplified by the recently determined ribosomal CCR5 frameshift pseudoknot [55] and the translational enhancer structures found in the 3â€² UTRs of plant viruses [5659]. It is therefore important to be able to compare these new structures to a database of known pseudoknots to determine the possibility of similar functionality. For example, some of the recently functionally similar pseudoknots found in the 3â€² UTRs of plant viruses have been shown to act as translational enhancers and have 3D structures that are similar to tRNAs. Importantly they contain pseudoknots that produce tRNAlike 3D folds, but are not derived from the standard tRNA secondary structure cloverleaf. In addition, these elements have been shown to be important for ribosome binding. RKalign will be useful in performing this kind of database search for structurefunction analysis of pseudoknots.
Extension to multiple alignment
Our pairwise alignment method can be extended to align multiple RNA pseudoknot structures by utilizing a guide tree. Specifically, we treat each structure as a cluster and use the expected accuracy defined in Equation (17) as the measure to determine the similarity of two structures or clusters. Initially, we merge two RNA structures that are most similar into one cluster. Subsequently we merge two clusters that are most similar into a larger cluster using the agglomerative hierarchical clustering algorithm [60], where the similarity of two clusters is calculated by the average linkage algorithm [60].
An alignment of two clusters is actually an alignment of two profiles, where each cluster is treated as a profile. Initially, each profile contains a single RNA pseudoknot structure. As the guide tree grows, a profile may contain multiple RNA pseudoknot structures; more precisely, the profile is a multiple alignment of these RNA structures. A single base of a profile is a column of the profile where the column contains single bases or gaps; a base pair of a profile includes two columns of the profile where the left column contains left bases or gaps and the right column contains corresponding right bases or gaps, and left bases and corresponding right bases form base pairs.
Suppose we want to align profile A' and profile B', which amounts to aligning two multiple alignments. Let R (S respectively) be an RNA pseudoknot structure in profile A' (B' respectively) and let (i, j) ((p, q) respectively) be a base pair of R (S respectively). Let (iâ€‰', jâ€‰') represent a base pair of profile A' and let (pâ€‰', qâ€‰') represent a base pair of profile B'. We use (i, j) âˆˆ (iâ€‰', jâ€‰') ((p, q)â€‰âˆˆâ€‰(pâ€‰', qâ€‰') respectively) to represent that (i, j) ((p, q) respectively) occurs in the column(s) of base pair (iâ€‰', jâ€‰') ((pâ€‰', qâ€‰') respectively) of profile A' (B' respectively). Equation (18) below shows how to calculate Prob ^{'}((i ^{'}, j ^{'})â€‰~â€‰(p ^{'}, q ^{'})), which represents the transformed probability of aligning base pair (iâ€‰', jâ€‰') of profile A' with base pair (pâ€‰', qâ€‰') of profile B'.
Here, Prob((i, j)â€‰~â€‰(p, q)) is defined in Equation (16), A' represents the number of RNA pseudoknot structures in profile or cluster A', and B' represents the number of RNA pseudoknot structures in profile or cluster B'.
The multiple alignment algorithm can now be summarized as follows. The input of the algorithm is a set SS of RNA pseudoknot structures. For every two structures A and B in SS, we calculate their posterior logodds score matrix as described in the â€˜Calculation of posterior logodds scoresâ€™ subsection. After all the posterior logodds score matrices are calculated, we compute the expected accuracy \( \overline{Accu}\left({a}_{A,B}\right) \) as defined in Equation (17) where a _{ A,B } is a (sub)optimal alignment, found by the heuristic described in the â€˜Pairwise alignmentâ€™ subsection, between structure A and structure B. We use the expected accuracy or similarity values to construct the guide tree for the set SS, to determine the order in which two structures or profiles are aligned. To align two profiles A' and B', we use the same heuristic as described in the â€˜Pairwise alignmentâ€™ subsection, with the transformed probabilities Probâ€‰'â€‰((iâ€‰',â€‰jâ€‰') ~ (pâ€‰',â€‰qâ€‰')) defined in Equation (18) replacing the posterior probabilities Prob((i, j)â€‰~â€‰(p, q)) of structures A and B defined in Equation (16). The time complexity of this multiple alignment algorithm is O(k ^{2} n ^{4}) where k is the number of structures in the alignment and n is the maximum of the lengths of the structures; the space complexity of the algorithm is O(k ^{2} n ^{2}).
We tested our algorithm by selecting 30 groups each having 3, 4, or 5 pseudoknot structures of the same type from the datasets used in this study, and by performing multiple alignment in each group. The multiple alignments produced by our algorithm can be found in Additional file 4. We then compared our algorithm with three related methods: CARNA [14], RNA Sampler [11] and DAFS [12]. The base_mismatch ratio of a multiple alignment MA is defined as the sum of base_mismatch ratios of all pairs of structures in MA divided by the total number of structure pairs in MA, multiplied by 100%. The average base_mismatch ratio of RKalign (CARNA, RNA Sampler, DAFS, respectively) was 26.01% (25.79%, 32.15%, 29.23% respectively). RKalign and CARNA were not statistically different (Wilcoxon signed rank test, pâ€‰>â€‰0.05); the two methods were significantly better than RNA Sampler and DAFS (Wilcoxon signed rank test, pâ€‰<â€‰0.05).
Both our pairwise alignment and multiple alignment programs are available in the RKalign tool. This tool is capable of accepting as input pseudoknotted RNAs with both sequence (nucleotides or bases) and structure data (base pairs), and producing as output an alignment between the pseudoknotted RNAs. As more and more pseudoknots are revealed, collected and stored in public databases, we anticipate a tool like RKalign will play a significant role in data comparison, annotation, analysis, and retrieval in these databases.
Comparison with related methods
RKalign is designed to align known RNA pseudoknot structures. A different approach is to simultaneously fold and align RNA sequences without known structures, as adopted by several existing tools [1113]. When the structure information is not available, this simultaneous folding and alignment approach is the best. However, when pseudoknot structures already exist, RKalign performs significantly better than the existing tools, as observed in our experiments. The reason is that the structures predicted by these tools may not be correct. As a consequence, there are many base mismatches with respect to the known structures in the resulting alignments.
Pseudoknots are part of RNA tertiary motifs [2]. There are 3D alignment programs that can compare RNA tertiary structures including pseudoknots [36,37]. These programs consider the entire RNA 3D structure as a whole, and accept PDB files with 3D coordinates as input. As shown in our experiments, when considering and aligning secondary structures with pseudoknots, RKalign outperforms the 3D alignment programs. It should be noted, however, that the 3D alignment programs are generalpurpose structure alignment tools capable of comparing two RNA 3D molecules with diverse tertiary motifs, whereas RKalign deals with secondary structures with pseudoknots only.
While the work reported here focuses on pseudoknot alignment, it can also be applied to RNA secondary structures without pseudoknots. We applied RKalign to 102 pairs of pseudoknotfree structures taken from RNA STARND where the pseudoknotfree structures belonged to Rfam [46] (see Additional file 1: Table S3). We compared RKalign with three other tools: CARNA [14], RNAforester [41] and RSmatch [40]. RNAforester, included in the widely used Vienna RNA package [61], is a versatile RNA structure alignment tool. Like RKalign and CARNA, an option of RNAforester is able to accept as input two RNA molecules with both sequence data (nucleotides or bases) and secondary structure data (base pairs), and produce as output the global alignment of the two molecules. However, a limitation of RNAforester is that the aligned secondary structures cannot contain pseudoknots. RSmatch is similar to RNAforester, sharing the same limitation. Our experimental results showed that the average base_mismatch ratio for RKalign (CARNA, RNAforester, RSmatch, respectively) was 43.52% (42.27%, 35.11%, 39.66%, respectively), indicating RNAforester performed the best. These results are understandable, considering that RKalign is mainly designed for comparing complex pseudoknot structures whereas RNAforester focuses on simpler pseudoknotfree structures.
The work that is most closely related to RKalign is CARNA [14]. Both methods are able to accept as input known pseudoknot structures and produce as output an alignment of the known structures. Our experimental results indicated that the two methods perform well in terms of base_mismatch ratios, though RKalign yields much lower stem_mismatch ratios. It should be pointed out, however, that the comparison with CARNA is not completely fair. The input data of RKalign are restricted to fixed structures, which are structures used in this study. Using CARNA with fixed structures is more or less a misuse of the tool. The main purpose of CARNA is to align dotplots, and its scoring is optimized for that data format. Thus, when dotplots are considered, one should use CARNA. When fixed structures are considered, RKalign is recommended.
Availability
The latest version of RKalign can be downloaded at: http://bioinformatics.njit.edu/RKalign.
References
Pleij CWA, Rietveld K, Bosch L. A new principle of RNA folding based on pseudoknotting. Nucleic Acids Res. 1985;13(5):1717â€“31.
Xin Y, Laing C, Leontis NB, Schlick T. Annotation of tertiary interactions in RNA structures reveals variations and correlations. RNA. 2008;14(12):2465â€“77.
Laing C, Wen D, Wang JTL, Schlick T. Predicting coaxial helical stacking in RNA junctions. Nucleic Acids Res. 2012;40(2):487â€“98.
Huang Z, Wu Y, Robertson J, Feng L, Malmberg R, Cai L. Fast and accurate search for noncoding RNA pseudoknot structures in genomes. Bioinformatics. 2008;24(20):2281â€“7.
GriffithsJones S, Bateman A, Marshall M, Khanna A, Eddy SR. Rfam: an RNA family database. Nucleic Acids Res. 2003;31(1):439â€“41.
Adams PL, Stahley MR, Kosek AB, Wang J, Strobel SA. Crystal structure of a selfsplicing group I intron with both exons. Nature. 2004;430(6995):45â€“50.
Theimer CA, Blois CA, Feigon J. Structure of the human telomerase RNA pseudoknot reveals conserved tertiary interactions essential for function. Mol Cell. 2005;17(5):671â€“82.
Staple DW, Butcher SE. Pseudoknots: RNA structures with diverse functions. PLoS Biol. 2005;3(6):e213.
Reidys CM, Huang FWD, Andersen JE, Penner RC, Stadler PF, Nebel ME. Topology and prediction of RNA pseudoknots. Bioinformatics. 2011;27(8):1076â€“85.
Nixon PL, Rangan A, Kim YG, Rich A, Hoffman DW, Hennig M, et al. Solution structure of a luteoviral P1P2 frameshifting mRNA pseudoknot. J Mol Biol. 2002;322(3):621â€“33.
Xu X, Ji Y, Stormo GD. RNA Sampler: a new sampling based algorithm for common RNA secondary structure prediction and structural alignment. Bioinformatics. 2007;23(15):1883â€“91.
Sato K, Kato Y, Akutsu T, Asai K, Sakakibara Y. DAFS: simultaneous aligning and folding of RNA sequences via dual decomposition. Bioinformatics. 2012;28(24):3218â€“24.
Meyer IM, Miklos I. SimulFold: simultaneously inferring RNA structures including pseudoknots, alignments, and trees using a Bayesian MCMC framework. PLoS Comput Biol. 2007;3(8):e149.
Sorescu DA, Mohl M, Mann M, Backofen R, Will S. CARNAalignment of RNA structure ensembles. Nucleic Acids Res. 2012;40(Web Server issue):W49â€“53.
Andronescu M, Bereg V, Hoos HH, Condon A. RNA STRAND: the RNA secondary structure and statistical analysis database. BMC Bioinformatics. 2008;9:340.
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, et al. The protein data bank. Nucleic Acids Res. 2000;28(1):235â€“42.
van Batenburg FHD, Gultyaev AP, Pleij CWA, Ng J, Oliehoek J. PseudoBase: a database with RNA pseudoknots. Nucleic Acids Res. 2000;28(1):201â€“4.
Taufer M, Licon A, Araiza R, Mireles D, van Batenburg FH, Gultyaev AP, et al. PseudoBase++: an extension of PseudoBase for easy searching, formatting and visualization of pseudoknots. Nucleic Acids Res. 2009;37(Database issue):D127â€“35.
Miyazawa S. A reliable sequence alignment method based on probabilities of residue correspondences. Protein Eng. 1995;8(10):999â€“1009.
Roshan U, Livesay DR. Probalign: multiple sequence alignment using partition function posterior probabilities. Bioinformatics. 2006;22(22):2715â€“21.
Do CB, Mahabhashyam MS, Brudno M, Batzoglou S. ProbCons: probabilistic consistencybased multiple sequence alignment. Genome Res. 2005;15(2):330â€“40.
Will S, Joshi T, Hofacker IL, Stadler PF, Backofen R. LocARNAP: accurate boundary prediction and improved detection of structural RNAs. RNA. 2012;18(5):900â€“14.
Mathews DH, Turner DH. Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. J Mol Biol. 2002;317(2):191â€“203.
Torarinsson E, Havgaard JH, Gorodkin J. Multiple structural alignment and clustering of RNA sequences. Bioinformatics. 2007;23(8):926â€“32.
Will S, Reiche K, Hofacker IL, Stadler PF, Backofen R. Inferring noncoding RNA families and classes by means of genomescale structurebased clustering. PLoS Comput Biol. 2007;3(4):e65.
Mohl M, Will S, Backofen R. Lifting prediction to alignment of RNA pseudoknots. J Comput Biol. 2010;17(3):429â€“42.
Han B, Dost B, Bafna V, Zhang S. Structural alignment of pseudoknotted RNA. J Comput Biol. 2008;15(5):489â€“504.
Yoon BJ. Efficient alignment of RNAs with pseudoknots using sequence alignment constrains. EURASIP J Bioinform Syst Biol. 2009;2009:491074.
Wong TKF, Wan KL, Hsu BY, Cheung BW, Hon WK, Lam TW, et al. RNASAlign: RNA structural alignment system. Bioinformatics. 2011;27(15):2151â€“2.
Puton T, Kozlowski LP, Rother KM, Bujnicki JM. CompaRNA: a server for continuous benchmarking of automated methods for RNA secondary structure prediction. Nucleic Acids Res. 2013;41(7):4307â€“23.
Ferre F, Ponty Y, Lorenz WA, Clote P. DIAL: a web server for the pairwise alignment of two RNA threedimensional structures using nucleotide, dihedral angle and basepairing similarities. Nucleic Acids Res. 2007;35:W659â€“68.
Capriotti E, MartiRenom MA. SARA: a server for function annotation of RNA structures. Nucleic Acids Res. 2009;37:W260â€“5.
Chang YF, Huang YL, Lu CL. SARSA: a web tool for structural alignment of RNA using a structural alphabet. Nucleic Acids Res. 2008;36:W19â€“24.
Wang CW, Chen KT, Lu CL. iPARTS: an improved tool of pairwise alignment of RNA tertiary structures. Nucleic Acids Res. 2010;38:W340â€“7.
Hoksza D, Svozil D. Efficient RNA pairwise structure comparison by SETTER method. Bioinformatics. 2012;28(14):1858â€“64.
Rahrig RR, Leontis NB, Zirbel CL. R3D Align: global pairwise alignment of RNA 3D structures using local superpositions. Bioinformatics. 2010;26(21):2689â€“97.
He G, Steppi A, Laborde J, Srivastava A, Zhao P, Zhang J. RASS: a web server for RNA alignment in the joint sequencestructure space. Nucleic Acids Res. 2014;42(Web Server issue):W377â€“81.
Zhong C, Zhang S. Efficient alignment of RNA secondary structures using sparse dynamic programming. BMC Bioinformatics. 2013;14:269.
Shapiro BA, Zhang K. Comparing multiple RNA secondary structures using tree comparisons. Comput Appl Biosci. 1990;6(4):309â€“18.
Liu J, Wang JTL, Hu J, Tian B. A method for aligning RNA secondary structures and its application to RNA motif detection. BMC Bioinformatics. 2005;6:89.
HÃ¶chsmann M, Voss B, Giegerich R. Pure multiple RNA secondary structure alignments: a progressive profile approach. IEEE/ACM Trans Comput Biol Bioinform. 2004;1(1):53â€“62.
Jiang T, Lin G, Ma B, Zhang K. A general edit distance between RNA structures. J Comput Biol. 2002;9(2):371â€“88.
Klein RJ, Eddy SR. RSEARCH: finding homologs of single structured RNA sequences. BMC Bioinformatics. 2003;4:44.
Gardner PP, Wilm A, Washietl S. A benchmark of multiple sequence alignment programs upon structural RNAs. Nucleic Acids Res. 2005;33(8):2433â€“9.
Brown JW. The Ribonuclease P Database. Nucleic Acids Res. 1999;27(1):314.
Burge SW, Daub J, Eberhardt R, Tate J, Barquist L, Nawrocki EP, et al. Rfam 11.0: 10Â years of RNA families. Nucleic Acids Res. 2013;41(Database issue):D226â€“32.
Yang H, Jossinet F, Leontis N, Chen L, Westbrook J, Berman H, et al. Tools for the automatic identification and classification of RNA base pairs. Nucleic Acids Res. 2003;31(13):3450â€“60.
Wilcoxon F. Probability table for individual comparisons by ranking methods. Biometrics. 1947;3(3):119â€“22.
Wilm A, Mainz I, Steger G. An enhanced RNA alignment benchmark for sequence alignment programs. Algorithms Mol Biol. 2006;1:19.
Nuin PA, Wang Z, Tillier ER. The accuracy of several multiple sequence alignment programs for proteins. BMC Bioinformatics. 2006;7:471.
Thompson JD, Plewniak F, Poch O. A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res. 1999;27(13):2682â€“90.
Bremges A, Schirmer S, Giegerich R. Finetuning structural RNA alignments in the twilight zone. BMC Bioinformatics. 2010;11:222.
Rahrig RR, Petrov AI, Leontis NB, Zirbel CL. R3D Align web server for global nucleotide to nucleotide alignments of RNA 3D structures. Nucleic Acids Res. 2013;41:W15â€“21.
Song Y, Liu C, Malmberg R, Pan F, Cai L. Tree decomposition based fast search of RNA structures including pseudoknots in genomes. Proc IEEE Comput Syst Bioinform Conf. 2005;1:223â€“34.
Belew AT, Meskauskas A, Musalgaonkar S, Advani VM, Sulima SO, Kasprzak WK, et al. Ribosomal frameshifting in the CCR5 mRNA is regulated by miRNAs and the NMD pathway. Nature. 2014;512(7514):265â€“9.
Gao F, Kasprzak WK, Szarko C, Shapiro BA, Simon AE. The 3' untranslated region of Pea Enation Mosaic Virus contains two Tshaped, ribosomebinding, capindependent translation enhancers. J Virol. 2014;88(20):11696â€“712.
Gao F, Kasprzak W, Stupina VA, Shapiro BA, Simon AE. A ribosomebinding, 3' translational enhancer has a Tshaped structure and engages in a longdistance RNARNA interaction. J Virol. 2012;86(18):9828â€“42.
Stupina VA, Meskauskas A, McCormack JC, Yingling YG, Shapiro BA, Dinman JD, et al. The 3' proximal translational enhancer of Turnip crinkle virus binds to 60S ribosomal subunits. RNA. 2008;14(11):2379â€“93.
McCormack JC, Yuan X, Yingling YG, Kasprzak W, Zamora RE, Shapiro BA, et al. Structural domains within the 3' untranslated region of Turnip crinkle virus. J Virol. 2008;82(17):8706â€“20.
Han J, Kamber M, Pei J. Data Mining: Concepts and Techniques. 3rd ed. Waltham, Massachusetts: Morgan Kaufmann Publishers; 2011.
Hofacker IL. Vienna RNA, secondary structure server. Nucleic Acids Res. 2003;31(13):3429â€“31.
Acknowledgements
We are grateful to the handling editor and anonymous reviewers for their constructive suggestions, which helped improve both the presentation and content of this paper. We thank the developers of CARNA for providing this robust and userfriendly tool on the web. We also thank Drs. Usman Roshan and Kaizhong Zhang for useful conversations. This research was supported in part by the National Science Foundation under Grant No. IIS0707571 and by the Intramural Research Program of the NIH, National Cancer Institute, Center for Cancer Research.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authorsâ€™ contributions
BS and JW conceived the study. YS and JW designed the algorithms. LH and YS implemented the algorithms and performed the experiments. JW, YS and BS wrote the manuscript. All authors read and approved the final manuscript.
Additional files
Additional file 1:
Supplementary information for â€˜Effective alignment of RNA pseudoknot structures using partition function posterior logodds scoresâ€™. Table S1. The RNA pseudoknot structures selected from the PDB and RNA STRAND to perform the alignment quality experiments. Table S2. The RNA pseudoknot structures selected from PseudoBase to perform the alignment quality experiments. Table S3. The RNA pseudoknotfree structures selected from Rfam and RNA STRAND to perform the alignment quality experiments.
Additional file 2:
The 106 pairwise alignments produced by RKalign on Dataset1.
Additional file 3:
The 124 pairwise alignments produced by RKalign on Dataset2.
Additional file 4:
The 30 multiple alignments produced by RKalign.
Rights and permissions
Open Access Â This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the articleâ€™s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the articleâ€™s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.
The Creative Commons Public Domain Dedication waiver (https://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Song, Y., Hua, L., Shapiro, B.A. et al. Effective alignment of RNA pseudoknot structures using partition function posterior logodds scores. BMC Bioinformatics 16, 39 (2015). https://doi.org/10.1186/s1285901504649
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285901504649