Efficient alignment of RNA secondary structures using sparse dynamic programming
© Zhong and Zhang; licensee BioMed Central Ltd. 2013
Received: 17 January 2013
Accepted: 3 September 2013
Published: 8 September 2013
Current advances of the next-generation sequencing technology have revealed a large number of un-annotated RNA transcripts. Comparative study of the RNA structurome is an important approach to assess their biological functionalities. Due to the large sizes and abundance of the RNA transcripts, an efficient and accurate RNA structure-structure alignment algorithm is in urgent need to facilitate the comparative study. Despite the importance of the RNA secondary structure alignment problem, there are no computational tools available that provide high computational efficiency and accuracy. In this case, designing and implementing such an efficient and accurate RNA secondary structure alignment algorithm is highly desirable.
In this work, through incorporating the sparse dynamic programming technique, we implemented an algorithm that has an O(n3) expected time complexity, where n is the average number of base pairs in the RNA structures. This complexity, which can be shown assuming the polymer-zeta property, is confirmed by our experiments. The resulting new RNA secondary structure alignment tool is called ERA. Benchmark results indicate that ERA can significantly speedup RNA structure-structure alignments compared to other state-of-the-art RNA alignment tools, while maintaining high alignment accuracy.
Using the sparse dynamic programming technique, we are able to develop a new RNA secondary structure alignment tool that is both efficient and accurate. We anticipate that the new alignment algorithm ERA will significantly promote comparative RNA structure studies. The program, ERA, is freely available at http://genome.ucf.edu/ERA.
Non-coding RNAs (ncRNAs) have recently been recognized as important regulators of the biological systems [1, 2]. They participate in the control of alternative splicing , gene transcription  and translation , and mRNA localization . Most of the ncRNAs exert their biological functions by folding into specific structures, which makes the study of the RNA structurome a critical step towards complete understanding of the operational mechanism of the biological system . Recently, genome-wide RNA structurome analysis has led to many interesting discoveries regarding novel regulatory mechanisms. For example, analysis of the RNA structural elements in Drosophila melanogaster 3’-UTR suggests a cluster of ncRNA elements that can direct the localization of their upstream genes within the spermatids . Similar studies have also been applied to the Ciona intestinalis genome for novel ncRNA family discovery . With the finishing of the ENCODE  and modENCODE  projects, we expect that much more RNA transcripts will be experimentally identified. Many of these RNA transcripts may have exceptionally large sizes , and calls for more efficient computational tools to analyze their structures.
As more RNA transcripts are being discovered, the experimental approaches for probing ncRNA structures are also being revolutionized, allowing more accurate functional investigation through exploiting the structure-function relationship. Traditional RNA three-dimensional (3D) structure determination techniques such as X-ray crystallography, NMR and cryo-EM are expensive, making them inappropriate for genome-wide survey of RNA structures. Currently, the emerging massive parallel sequencing technology has been incorporated into the traditional chemical probing methods, making genome-wide experimental determination of RNA secondary structures possible and with low cost. Available techniques in this category include PARS , FragSeq , and SHAPE-seq . The RNA secondary structures determined by these techniques are much more accurate than those predicted by pure computational methods. For example, when coupled with SHAPE-seq data, the free energy minimization approach  is able to predict the secondary structure of a 16S rRNA with over 95% accuracy . In this case, the major purpose of this work is to develop an efficient and accurate RNA secondary structure alignment algorithm to facilitate genome-wide comparative studies of these RNA secondary structures.
There are many existing algorithms that focus on the RNA secondary structure alignment problem [18-24]. RNA secondary structures can be represented as tree structures, and the edit-distance between the tree structures can be used to represent their structural similarity . Algorithms using such strategy are usually called tree editing algorithms. Using heavy path decomposition, Klein  improved the time complexity of the tree editing algorithm to O(l3logl). Recently, Demaine et al. further improved the time complexity to O(l3) based on Klein’s algorithm. However, Jiang et al. proposed to compute tree alignment distance for the comparison of trees. Algorithms that compute such a measure are called tree alignment algorithms. The tree alignment algorithm is a special case of the tree editing algorithm . The tree alignment algorithm has been implemented into an RNA secondary structure alignment tool called RNAforester. Both the tree editing and tree alignment algorithms rely on tree representation of the RNA structure, and make sophisticated scoring functions difficult to implement (such as the affine gap penalty for the loop regions). In addition, both tree editing and tree alignment algorithms do not treat base pairs as units of comparison, and make it difficult to implement a complete set of base-pair edit operations for RNA secondary structures editing (base-pair match, mismatch, breaking, altering, and removing; as defined by Jiang et al.). We demonstrate such a problem by showing a real example from the implementation of the widely-used RNA secondary structure alignment tool RNAforester.
The above example clearly shows the limitation of the implementation of the tree-based RNA secondary structure alignment algorithm RNAforester. Implementing the complete set of base-pair edit operations under the tree representation appears to be not a trivial task. Therefore, we propose to implement the general edit-distance alignment approach where all edit operations can be implemented naturally. To guarantee that the implementation is as efficient as the Demaine et al.’s algorithm (O(l3)), we incorporate the sparse DP technique into a simultaneous alignment and folding (SAF) algorithm RNAscf and restrict its input to fixed RNA secondary structures (recall that the general edit-distance alignment algorithm is a restricted case of the SAF algorithm). Using this technique, we can reduce the original time complexity by reducing a factor from n2 to z, where n is the number of base pairs in the fixed RNA structures and n<z<n2. Under the assumption of the polymer-zeta property of RNA molecules , it is expected that z≪n2 and even z∈O(n). In this case, the new general edit-distance RNA structure-structure alignment algorithm will have a time complexity of O(z n2+z l2). The new time complexity has an expected cubic (z=O(n)=O(l)) growth behavior, and is the same as Demaime et al.’s algorithm . In addition, we also devise a novel online pruning technique to further speedup the new algorithm, which deletes obsolete candidates on-the-fly. By combining both speedup techniques, the new RNA structure alignment algorithm is capable of comparing RNA secondary structures efficiently and accurately.
We have implemented the proposed RNA structure alignment algorithm into a program called ERA (Efficient RNA Alignment). The benchmark results showed that ERA has the expected O(z l2) time complexity. We showed the O(z l2) time complexity of ERA through aligning Rfam  RNA structures that were carefully chosen to represent a wide rage of input sizes. We also used a BraliBase II  benchmark to compare tools ERA, LocARNA and RNAforester when aligning RNAs with known structures. Nearly identical alignment quality can be observed for the general edit-distance alignment tools ERA and LocARNA, while both of them are more accurate than the tree alignment algorithm RNAforester. Finally, we also concluded that ERA is efficiently implemented by observing an average of 10 fold speedup over LocARNA, and RNAforester in terms of real RNA structure alignments. Based on these results, we confirmed that the sparse DP technique and the online pruning technique are successfully incorporated into the original RNAscf algorithm. We also anticipate that ERA will become an important bioinformatics tool for comparative RNA structure analysis.
In this section, we will present a novel general edit-distance RNA structure alignment algorithm by incorporating the sparse DP technique into the RNAscf algorithm. RNAscf was originally designed to identify the consensus structure between two RNA sequences. It guides the DP process though stacks and has a time complexity of O(n4+n2l2). Comparing to LocARNA (which has a time complexity of O(l4+n2l2)), the indexing scheme used by RNAscf makes it easier to incorporate the sparse DP technique, which aims to reduce the size of n instead of l. In addition to the sparse DP technique, we will also present an online pruning technique, which tries to reduce the search space of the algorithm as the DP proceeds. Through combining these two speedup techniques, the novel algorithm will have an expected O(z l2) time complexity, where n<z≪n2.
The Methods section is organized as follows: In Section ‘Preliminaries and definitions’, we will give the basic definition of RNA structures and the RNA alignment problem. In Section ‘The original O(n4+n2l2) algorithm’, we will reintroduce the RNAscf algorithm as a basis to understand the novel algorithm that is developed in this work. In Section ‘Triangular inequality and optimal pair matchings’, we will present the triangular inequality in RNA alignment with necessary proofs, which serves as a theoretical foundation for the sparse DP technique. In Section ‘Detection of optimal pair matchings’, we will further discuss the implementation details of incorporating the sparse DP technique. In Section ‘A new algorithm with cubic time complexity’, we will present the novel RNA alignment algorithm with the incorporation of the sparse DP technique. In Section ‘Online pruning of optimal pair matchings’, we will present the online pruning technique as an additional speedup step to the novel algorithm. Finally, in Section ‘Pseudo-code’, we will summarize the new algorithm using pseudo-code that can be directly implemented.
Preliminaries and definitions
We will begin with the introduction of the basic symbols and notations. The secondary structure of an RNA A of length l A is represented by a set of base pairs in A, denoted as . A base pair is an interaction formed between two nucleotides in the sequence of A, whose positions are denoted by l(p A ) and r(p A ) (without loss of generality, we assume l(p A )<r(p A )). The base pair p A can also be represented as (l(p A ),r(p A )). The base pairs are partially ordered by the increasing order of their ending nucleotides, i.e. if and only if . Since we do not consider RNA ensembles, no crossing base pair is allowed. That is, we do not allow . The two base pairs and are either enclosing or juxtaposing to each other. The base pair encloses if , denoted as . The base pair juxtaposes to and before if , and is denoted by .
We also define loop regions (i.e. hairpin loop, internal/bulge loop, and multi-branch loop) whose sequence similarities are assessed by the alignment. The loop regions can be viewed as the unpaired regions in the RNA sequence that are segregated by the paired nucleotides. Let A[ i…j] denote a continuous sequence region in RNA A, which begins with the ith nucleotide and ends with the jth nucleotide. Define L(p A ) as the sequence A[ l(p A )+1…r(p A )-1] (hairpin loop). If , define as the sequence , and as the sequence (internal or bulge loop). If , define as the sequence (multi-branch loop).
Here, the first term is the summation of all structural similarities (S s t r ) between the annotated base pairs. The structural similarity score for base-pair substitution is set using the RIBOSUM matrix , denoting such base-pair substitution matrix as R. We do not give penalty for base-pair deletion or insertion, as we may expect incorrectly predicted base pairs in the input RNA structures. The second term is the summation of the sequence similarities (S s e q ) on all loop (unpaired) regions that are determined by base-pair matchings in. The sequence similarity between two sequence regions is computed as traditional sequence alignment, with D as a 4-by-4 matrix that accounts for nucleotide substitution (set using the RIBOSUM matrix), g as the gap opening penalty, and e as the gap extension penalty  (g and e are both set to negative values and g<e). The weights w1 and w2 are used to balance the structural and sequence contribution to the overall alignment score, and we set w1>w2 to emphasize structural similarity. To simplify the expressions, in the rest of this article, we assume that w1 has been multiplied to all structural similarity terms (R), and w2 has been multiplied to all sequence similarity terms (D, g, and e).
We will now define the matrices that are used by the DP algorithm. Denote M[ p A ,p B ] as the optimal structure alignment score between the regions enclosed by p A and p B , given that p A is matched with p B . Denote M h [ p A ,p B ] as the optimal alignment score when the matching of p A and p B corresponds to a hairpin loop in the consensus structure. Similarly, M l [ p A ,p B ] stores the optimal alignment score when the matching of p A and p B corresponds to an internal, a bulge, or a multi loop in the consensus structure. Assume that , and , M l [ p A ,p B ] can be computed by referring to the matrix , which stores the optimal alignment score between the juxtaposed base-pair chains (each chain contains at least one base pair) that end with and , respectively. The optimal alignment between A and B can be retrieved from , where and are pseudo base pairs such that , , and .
The original O(n4+n2l2) algorithm
In these recursive functions, S s t r denotes the structural similarity between two base pairs p A and p B , S s e q denotes the sequence similarity between two unpaired regions, and G indicates the gap penalty for completely deleting the corresponding unpaired region. Note that G(|L|)=g+|L|∗e if |L|>0, and G(|L|)=0 otherwise. The base pair set contains all base pairs that are directly before and juxtaposed to . In other words, if , then there is no such base pair , such that . In most real scenarios, is considered as a constant [29, 35]. This chaining technique based on the set enables us to handle the multi-loop case efficiently, by only considering cases when computing M c .
Recall that the input RNA sequences have an average length of l and form an average of n base pairs. This algorithm can be computed with an expected time complexity of O(n4+n2l2). To see the time complexity, first note that all sequence similarity scores that are referred in the recursive functions can be computed within O(n2l2) time. Because all loop regions are segregated by base pairs, the number of loop regions is clearly bounded by O(n). Therefore, there are O(n2) combinations of loop matchings, and computing each matching requires O(l2) time using a standard sequence alignment algorithm . To this point, we assume all sequence similarities are computed using O(n2l2) time, and are stored in a matrix for constant-time lookup. Now, observe that this algorithm computes the optimal alignment by filling up the DP table M, which contains O(n2) values. Computing each value in the matrix M depends on the corresponding values of M h , M l , and M c . The computation of values in matrix M h can be finished in a constant time due to the pre-computed sequence similarities. The computation of M l requires O(n2) time, as determined by the necessity of traversing all possible combinations i and i′ (see Equation 4). Finally, M c can also be expected to be computed in a constant time, as is assumed to be a constant. In this case, the computation of matrix M requires O(n4) time. Adding up the time required to pre-compute all sequence similarities of the loops, the overall time complexity for this algorithm thus becomes O(n4+n2l2).
Triangular inequality and optimal pair matchings
The triangular inequality property servers as the theoretical foundation for the sparse DP technique, which saves search space while maintaining the global optimality. For computational RNA studies, this technique has been used in RNA folding , RNA consensus folding (SAF) [36, 37], as well as RNA-RNA interaction prediction  applications. In this work, our aim is to bring this technique into the RNA structure alignment application, where fixed RNA structures are considered instead of RNA structure ensembles.
To simplify the expression of the triangular inequality property, we define a number of pseudo base pairs to indicate specific regions of interest. A pseudo base pair is a void interaction, such that the structural similarity between any two pseudo base pairs is defined to be 0. For instance, let p and p′ be two arbitrary pseudo base pairs, we will have S s t r (p,p′)=0. The pseudo base pairs are only used for the sake of representational simplicity, and are not required for the implementation of the algorithm. Define a pseudo base pair p A =(i,j) and a pseudo base pair p B =(i′,j′). In this case, the optimal alignment score between the regions A[ i…j] and B[ i′…j′], i.e. M[ i,j;i′,j′], can be rewritten as M[ p A ,p B ]. Similarly, define pseudo base pairs , , , and (see Figure 2 (a)). The triangular inequality can be simplified using the following observation:
Using Observation 1, we can detect potential redundant computations in the original algorithm. Consider the structural configurations shown in Figure 2 (b), and assume that the base pairs p A and p B are being aligned at the current stage. Let and be arbitrary base pairs such that . Note that may also represent a pseudo base pair in order to consider an arbitrary subregion enclosed by p A . Define pseudo base pairs , , , , , and . Pseudo base pairs are also added to B symmetrically (see Figure 2 (b)). We can then prove Lemma 1 using Observation 1:
If and , such that , then .
The first inequality is a direct application of Observation 1, and the second inequality is specified in the condition of Lemma 1.
Because and are arbitrary base pairs, Lemma 1 implies that the matching between p A and p B is guaranteed to be suboptimal. That is, the overall alignment score, given that p A matches with p B , is always lower than that when assuming they do not match (as the matching of p A and p B is conflicted with the matching of and , as well as the matching of and ). In this case, we can devise the DP algorithm to bypass the redundant references to the scenarios where p A matches p B . Conversely, for the implementation of this idea, the DP algorithm will refer to the scenarios of matching p A and p B only when the condition specified in Lemma 1 is NOT satisfied. These necessary base-pair matchings are called the Optimal Pair Matchings (OPMs). If the matching of p A and p B is an OPM, we denote this OPM as oA,B. Similarly, we represent the OPM formed by base pairs and as . The new RNA alignment algorithm will maintain an OPM list , which is modified online as the DP proceeds, so as to include newly identified OPMs and remove obsolete OPMs (which will be discussed in Section ‘Online pruning of optimal pair matchings’). If we assume that the RNA molecules have the polymer-zeta property , restricting the search space of the DP using the OPM list will reduce the time complexity of the RNA alignment algorithm to O(z l2) (as will be discussed in Section ‘A new algorithm with cubic time complexity’).
Detection of optimal pair matchings
In the previous section, we have proved that Lemma 1 can be used to detect the OPMs and save redundant computations. In this section, we will briefly discuss how it will be implemented. Lemma 1 states that if the alignment score assuming p A matches p B (M[ p A ,p B ]) is higher than the alignment score assuming p A does not match p B , the matching between p A and p B is an OPM. Therefore, to detect the OPMs, we need to compute two alignment scores, i.e. the one when assuming p A matches p B and the one when assuming p A does not match p B .
Based on previous definition, the first alignment score is computed as M[ p A ,p B ]. In this case, we only need to compute the second alignment score. However, computing the second alignment score (assuming p A does not match p B ) is difficult. Instead, we can compute the overall alignment score without assuming any restrictions. Apparently, the overall alignment score includes both cases disregarding whether p A matches with p B . Therefore, if M[ p A ,p B ] is greater than or equal to such an overall optimal alignment, it is guaranteed to be greater than the alignment score when assuming p A does not match p B , and ipso facto the matching of p A and p B is an OPM.
Recall that the alignment score M[ p A ,p B ] corresponds to the case where p A matches with p B , and therefore it can be decomposed as the sum of two parts: the structure similarity between the two base pairs themselves S s t r (p A ,p B ), and the optimal alignment score between the regions A[ l(p A )+1…r(p A )-1] and B[ l(p B )+1…r(p B )-1] without any restrictions. In this case, define two pseudo base pairs and , then can also be decomposed as the sum of two parts: , and the optimal alignment score between the regions A[ l(p A )…r(p A )] and B[ l(p B )…r(p B )] without any restrictions. Note that and are both pseudo base pairs, and thus based on the definition, we have . Therefore, is exactly the overall alignment score we need to detect the OPMs.
In this case, based on Lemma 1, if , we will consider the matching of p A and p B as an OPM, and add the OPM oA,B to the OPM list . The overhead for detecting the OPM is that we need to double the computation for each combination of p A and p B . However, such overhead will not raise the time complexity, and it is worthy as it will lead to a more significant speedup of the algorithm. In the following section, we will devise a new algorithm by assuming that the OPM list is available.
A new algorithm with cubic time complexity
In this section, we introduce a new general edit-distance RNA structure alignment algorithm, which improves the original RNAscf algorithm based on Lemma 1 and has a time complexity of O(z(n2+l2)). Here, z is the size of the OPM list , and we expect that z∈O(n) when assuming polymer-zeta property . If we also assume O(n)=O(l) (with fixed input RNA structures or efficiently pruned RNA structure ensembles), the overall time complexity of the new algorithm becomes O(z l2).
Here, the set contains all OPMs that are directly before the OPM . The set regarding the OPMs is defined as the follows. If an OPM , then either or .
Recall that the time complexity of the original algorithm is O(n4+n2l2). The first term O(n4) results from O(n2) computations by traversing all combinations of p A and p B (see Equation 2) and O(n2) time for computing M l (see Equation 4). In the new algorithm, we introduce the OPM constraint to Equation 8 and Equation 9, and thus reduce the time complexity for computing M l from O(n2) to O(z). In this case, the first term O(n4) of the original time complexity can be reduced to O(z n2).
The second term O(n2l2) in the original time complexity results from computing the sequence similarities between all loop regions. Note that all loop similarities required for computing M l (Equation 8) and M c (Equation 9) are associated with OPMs. For example, in Equation 8, all the loops are defined according to and , whose matching is expected to be an OPM. And in Equation 9, all the loops are defined according to and , as well as and , where both of these matchings are assumed to be OPMs. In this case, we do not need to compute loop similarities for all O(n2) base-pair combinations, instead we only need to compute the loop similarities that are associated with the OPMs. In this case, the time complexity for computing the sequence similarities between all loops that are required by the computation of M l and M c can be finished in O(z l2) time.
where d m a x is the highest score in the 4-by-4 nucleotide substitution matrix D, and I is a boolean variable that is set to 1 if |L(p A )|≠|L(p B )| and set to 0 otherwise. For the computation of each M[ p A ,p B ], we first estimate the upper bound in a unit time, and then compute M l [ p A ,p B ] in O(z) time. By comparing these two values, we will determine whether the computation of M h [ p A ,p B ] is necessary. The computation of M h [ p A ,p B ] is only necessary when there are only a few base pair enclosed by p A and p B to be matched. Such condition implies the scenarios that either p A or p B is a real hairpin loop in the RNA structures, whose number is bounded by O(n). Overall, the hairpin loop similarity matrix M h can be computed in O(n l2) time, and the overall time complexity of this algorithm is thus O(z(n2+l2)).
Online pruning of optimal pair matchings
In the previous sections, we have presented our approaches for detecting OPMs and building an OPM list , as well as a more efficient algorithm that is developed based on . Time complexity analysis of the algorithm claims that O(z(n2+l2)) time is sufficient for this new algorithm. The size of the OPM list , i.e. z, thus becomes an important factor that determines the efficiency of the novel algorithm. Under the current algorithmic setup, as well as other similar works that implement a candidate list [30, 37], z continuously grows as the algorithm proceeds. In this case, it is desirable to devise an online pruning technique, which can remove the obsolete OPMs from , and thus achieve further speedup of the algorithm.
In this section, we will present such an online pruning technique to reduce the size of the OPM list . The intuition of this online pruning technique comes from the following observation. The RNA structures are primarily stabilized by a number of helices, or perfectly stacked base pairs. If is perfectly stacked on , then , and . Consider the alignment between two helices, where each one of them contains m+1 perfectly stacked base pairs. Assume that the first helix contains base pairs , and the second helix contains base pairs . Based on Lemma 1, there will be at least m OPMs detected from such alignment, i.e. . Apparently, maintaining all these m OPMs is unnecessary, as these base pairs should be aligned together as two complete helices, rather than be aligned separately as two sets of individual base pairs. In this case, maintaining only one OPM, i.e. , is sufficient to represent such an alignment. The other m OPMs become obsolete as soon as the OPM is detected, and can be removed from the OPM list to improve computational efficiency. In the following paragraphs, we will extend this idea to consider all situations in addition to the perfectly stacked scenario, as well as give formal description of this technique and related proofs.
We will demonstrate the major idea of our novel online OPM pruning technique using Figure 2 (b). Imagine that at the current stage, M[ p A ,p B ] has just been computed and oA,B has been identified as an OPM, where is an arbitrary OPM that has been previously identified and is enclosed by oA,B ( and ). Our aim is to estimate whether the detection of the OPM oA,B will make obsolete. Let and be arbitrary base pairs such that and . The regions enclosed by and can be partitioned using at least one of the following ways: (which is indicated by dark gray in Figure 2 (b)) and (which is indicated by light gray in Figure 2 (b)). If the corresponding score for the first path is higher than the second, will not be referred to by any future matching between arbitrary base pairs and , and thus making the OPM obsolete. In this case, the OPM can be removed from .
In this case, if , we are sure that the criterion for characterizing as an obsolete OPM will be satisfied, and we will be able to remove from immediately.
Now, we can discuss the details for setting up the upper bounds and . Because and are defined symmetrically, we only discuss the computation of . Note that the upper bound needs to satisfy the condition . Clearly, the difference between directly comes from concatenating the region to the region , as well as concatenating the region to the region . The best case scenario for such an operation, is to assume that the concatenation of the regions and will result in as many new base-pair and nucleotide matches as possible.
With the upper bounds and , we are able to formally prove the correctness of the online OPM pruning technique:
If , where and , then .
As a result, when the condition given in Lemma 2 is satisfied, the enclosed OPM can be readily removed.
We implemented the proposed general edit-distance RNA structural alignment algorithm into a program called ERA (Efficient RNA Alignment) using GNU C++. In this section, we will show that (1) ERA has the expected O(z l2) time complexity; (2) ERA is as accurate as the other state-of-the-art RNA alignment tools; and (3) ERA runs much faster than the other RNA alignment tools. In addition to these goals, we have also benchmarked ERA to demonstrate its O(l2) space complexity. For details regarding the space complexity issues please refer to the Additional file 1: Section S1 (also see Figure S1, Figure S2, and Table S1).
We benchmarked the ERA with two other state-of-the-art RNA alignment tools: LocARNA as a representative of the general edit-distance RNA structure alignment algorithms and RNAforester as a representative of the tree-based RNA structure alignment algorithms. Note that although LocARNA is developed to compare RNA structure ensembles, its flexible parameter setup makes it easy to prune its input RNA ensembles (see Section ‘Running LocARNA’ for more details). However, the readers should note that LocARNA is used in a restricted case for fair comparison with ERA, and more potential applications of LocARNA should be recognized. We do not compare ERA with its predecessor RNAscf, because RNAscf is implemented to find consensus helical configurations that do not include individual base pairs . Both LocARNA and RNAforester were invoked using their default parameters.
Note that LocARNA was originally developed to compare two RNA structure ensembles . Due to the recent technical advances in experimental RNA structure probing, we anticipate that RNA structures can be predicted with much higher accuracy. Therefore, we develop ERA to compare two fixed RNA structures. In this case, we need to prune the original inputs of LocARNA, so as to ensure that they only represent the fixed structures rather than any additional information.
The input RNA ensembles for LocARNA are represented using the base-pairing probability matrices, which can be computed using the McCaskill’s algorithm [40, 41]. In a base-pairing probability matrix, each base pair (possibly crossing) is assigned with a probability to indicate its thermodynamic stability. Our goal is to prune such a base-pair probability matrix, such that it only contains information regarding the fixed RNA structure (in our experiment, we take the Rfam  annotation or the BraliBase II  annotation as the fixed structure for an RNA sequence). For each base pair in the matrix, if it is not presented in the annotated structure, its corresponding probability is reset to 0. On the other hand, if it is included in the annotated structure, its probability is reset to 1. In this case, the pruned base-pairing probability matrix contains only the information regarding the fixed RNA structure. We show an original and a pruned base-pairing probability matrix in Additional file 1: Figure S3 as an example. All LocARNA inputs for experiments mentioned in this article are preprocessed using this strategy.
Running time speedup
Comparison on running time of ERA, LocARNA , and RNAforester
Discussion and conclusions
In this article, we have presented a novel algorithm for efficient alignment of RNA secondary structures by incorporating the sparse DP technique. The major theoretical contribution of this work lies in two parts. First, to our knowledge, this is the first application of the sparse DP technique to RNA structure-structure alignment. Second, the novel online OPM pruning technique can provide insights for future algorithm designs that need to maintain a candidate list. The implementation of this novel algorithm is a tool called ERA, which can run in O(z l2) time and O(l2). Such time and space complexity make ERA one of the most efficient RNA structure alignment tools that are currently available.
The online OPM pruning technique is newly developed from this work, which aims at deleting obsolete candidates as the DP proceeds. Although this technique cannot improve the computational complexity, it is efficient in reducing the real running time. In Additional file 1: Table S3, we summarized the running time of ERA in aligning individual RNA structures, with or without the online OPM pruning technique. We observed that by incorporating this technique, the running time of ERA was reduced by an average of 2.3 fold. Meanwhile, the speedup ratio is highly uniform (with 1.7 fold as the lowest and 3.1 fold as the highest) across RNA structures with different sizes, meaning that it reduces running time by a constant factor. The online OPM pruning technique can also be modified and incorporated into other related algorithms that implement the candidate list, such as the sparse DP algorithms for RNA folding , RNA consensus folding [36, 37], and RNA-RNA interaction .
The speedup of ERA is most significant when the number of base pairs in the RNA structures is small. This is because the algorithm is indexed by base pairs and has a time complexity of O(z(n2+l2)). As n increases, the term O(z n2) will dominate the overall time complexity. In this case, an ideal application of ERA is to align fixed RNA structures, because it guarantees that n<l. Note that as a sparsified version of the SAF algorithm RNAscf, the new algorithm developed here is also capable of handling RNA structure ensemble alignments. However, we do not implement this feature into ERA, because one cannot guarantee n<l for RNA ensemble alignments. This would make the speedup of ERA less significant. Besides, there are other alternative tools [36, 37] available for such a purpose.
With the completion of the ENCODE  and modENCODE  projects, more and more RNA transcripts will be experimentally revealed. At the same time, with the advance of high-throughput RNA structure probing techniques [13-15], the secondary structures of these RNA transcripts will also be predicted with a much higher accuracy. In this case, ERA, which can compare fixed RNA structure efficiently and accurately, becomes an ideal computational tool to evaluate the structural similarities of these RNA transcripts. ERA can be used to perform all-against-all alignments on these RNA transcripts, which will then be subsequently summarized as the distance matrix for clustering purposes. Various clustering algorithms [8, 39] can then be applied to identify ncRNA families with similar secondary structures and infer their amazing cellular and molecular functionalities.
This work is supported by the National Institute of General Medical Sciences of the National Institutes of Health, USA (R01GM102515).
- Eddy S: Non-coding RNA genes and the modern RNA world. Nat Rev Genet. 2001, 2: 919-929. 10.1038/35103511.View ArticlePubMedGoogle Scholar
- Storz G: An expanding universe of noncoding RNAs. Science. 2002, 296: 1260-1263. 10.1126/science.1072249.View ArticlePubMedGoogle Scholar
- Tripathi V, Ellis JD, Shen Z, Song DY, Pan Q, Watt AT, Freier SM, Bennett CF, Sharma A, Bubulya PA, Blencowe BJ, Prasanth SG, Prasanth KV: The nuclear-retained noncoding RNA MALAT1 regulates alternative splicing by modulating SR splicing factor phosphorylation. Mol Cell. 2010, 39: 925-938. 10.1016/j.molcel.2010.08.011.PubMed CentralView ArticlePubMedGoogle Scholar
- Tucker BJ, Breaker RR: Riboswitches as versatile gene control elements. Curr Opin Struct Biol. 2005, 15: 342-348. 10.1016/j.sbi.2005.05.003.View ArticlePubMedGoogle Scholar
- Bartel DP: MicroRNAs: target recognition and regulatory functions. Cell. 2009, 136: 215-233. 10.1016/j.cell.2009.01.002.PubMed CentralView ArticlePubMedGoogle Scholar
- Crucs S, Chatterjee S, Gavis ER: Overlapping but distinct RNA elements control repression and activation of nanos translation. Mol Cell. 2000, 5: 457-467. 10.1016/S1097-2765(00)80440-2.View ArticlePubMedGoogle Scholar
- Wan Y, Kertesz M, Spitale RC, Segal E, Chang HY: Understanding the transcriptome through RNA structure. Nat Rev Genet. 2011, 12: 641-655. 10.1038/nrg3049.View ArticlePubMedGoogle Scholar
- Zhong C, Andrews J, Zhang S: Discovering non-coding RNA elements in Drosophila 3’ un-translated regions. Proceedings of the 2nd IEEE International Conference of Computational Advances in Bio and Medical Sciences. 2012, IEEE, 1-6.Google Scholar
- Will S, Reiche K, Hofacker IL, Stadler PF, Backofen R: Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering. PLoS Comput Biol. 2007, 3: e65-10.1371/journal.pcbi.0030065.PubMed CentralView ArticlePubMedGoogle Scholar
- Bernstein BE, Birney E, Dunham I: An integrated encyclopedia of DNA elements in the human genome. Nature. 2012, 489: 57-74. 10.1038/nature11247.View ArticleGoogle Scholar
- Celniker SE, Dillon LA, Gerstein MB, Gunsalus KC, Henikoff S, Karpen GH, Kellis M, Lai EC, Lieb JD, MacAlpine DM, Micklem G, Piano F, Snyder M, Stein L, White KP, Waterston RH: Unlocking the secrets of the genome. Nature. 2009, 459: 927-930. 10.1038/459927a.PubMed CentralView ArticlePubMedGoogle Scholar
- Mercer TR, Dinger ME, Mattick JS: Long non-coding RNAs: insights into functions. Nat Rev Genet. 2009, 10: 155-159. 10.1038/nrg2521.View ArticlePubMedGoogle Scholar
- Kertesz M, Wan Y, Mazor E, Rinn JL, Nutter RC, Chang HY, Segal E: Genome-wide measurement of RNA secondary structure in yeast. Nature. 2010, 467: 103-107. 10.1038/nature09322.View ArticlePubMedGoogle Scholar
- Underwood JG, Uzilov AV, Katzman S, Onodera CS, Mainzer JE, Mathews DH, Lowe TM, Salama SR, Haussler D: FragSeq: transcriptome-wide RNA structure probing using high-throughput sequencing. Nat Methods. 2010, 7: 995-1001. 10.1038/nmeth.1529.PubMed CentralView ArticlePubMedGoogle Scholar
- Lucks JB, Mortimer SA, Trapnell C, Luo S, Aviran S, Schroth GP, Pachter L, Doudna JA, Arkin AP: Multiplexed RNA structure characterization with selective 2’-hydroxyl acylation analyzed by primer extension sequencing (SHAPE-Seq). Proc Natl Acad Sci USA. 2011, 108: 11063-11068. 10.1073/pnas.1106501108.PubMed CentralView ArticlePubMedGoogle Scholar
- Reuter JS, Mathews DH: RNAstructure: software for RNA secondary structure prediction and analysis. BMC Bioinformatics. 2010, 11: 129-10.1186/1471-2105-11-129.PubMed CentralView ArticlePubMedGoogle Scholar
- Deigan KE, Li TW, Mathews DH, Weeks KM: Accurate SHAPE-directed RNA structure determination. Proc Natl Acad Sci USA. 2009, 106: 97-102. 10.1073/pnas.0806929106.PubMed CentralView ArticlePubMedGoogle Scholar
- Tai KC: The tree-to-tree correction problem. J ACM. 1979, 26: 422-433. 10.1145/322139.322143.View ArticleGoogle Scholar
- Zhang K, Shasha D: Simple fast algorithms for the editing distance between trees and related problems. SIAM J Comput. 1989, 18: 1245-1262. 10.1137/0218082.View ArticleGoogle Scholar
- Jiang T, Wang L, Zhang K: Alignment of trees - an alternative to tree edit. Theor Comput Sci. 1995, 143: 137-148.View ArticleGoogle Scholar
- Höchsmann M, Töller T, Giegerich R, Kurtz S: Local similarity in RNA secondary structures. Proceedings of the 2nd IEEE Computer Society Bioinformatics Conference. 2003, Washington DC: IEEE Computer Society, 159-168.Google Scholar
- Chen S, Zhang K: An improved algorithm for tree edit distance with applications for RNA secondary structure comparison. J Comb Optim. 2012, 1-20.Google Scholar
- Bafna V, Muthukrishnan S, Ravi R: Computing similarity between RNA strings. Proceedings of the 6th Annual Symposium on Combinatorial Pattern Matching. 1995, Berlin Heidelberg: Springer-Verlag, 1-16.View ArticleGoogle Scholar
- Jiang T, Lin G, Ma B, Zhang K: A general edit distance between RNA structures. J Comput Biol. 2002, 9: 371-388. 10.1089/10665270252935511.View ArticlePubMedGoogle Scholar
- Klein PN: Computing the edit-distance between unrooted ordered trees. Proceedings of the 6th Annual European Symposium on Algorithms. 1998, Berlin Heidelberg: Springer-Verlag, 91-102.Google Scholar
- Demaine ED, Mozes S, Rossman B, Weimann O: An optimal decomposition algorithm for tree edit distance. ACM Trans Algo. 2009, 6: 1-19.View ArticleGoogle Scholar
- Bille P: A survey on tree edit distance and related problems. Theor Comput Sci. 2005, 337: 217-239. 10.1016/j.tcs.2004.12.030.View ArticleGoogle Scholar
- Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970, 48: 443-453. 10.1016/0022-2836(70)90057-4.View ArticlePubMedGoogle Scholar
- Bafna V, Tang H, Zhang S: Consensus folding of unaligned RNA sequences revisited. J Comput Biol. 2006, 13: 283-295. 10.1089/cmb.2006.13.283.View ArticlePubMedGoogle Scholar
- Wexler Y, Zilberstein C, Ziv-Ukelson M: A study of accessible motifs and RNA folding complexity. J Comput Biol. 2007, 14: 856-872. 10.1089/cmb.2007.R020.View ArticlePubMedGoogle Scholar
- Griffiths-Jones S, Bateman A, Marshall M, Khanna A, Eddy SR: Rfam: an RNA family database. Nucleic Acids Res. 2003, 31: 439-441. 10.1093/nar/gkg006.PubMed CentralView ArticlePubMedGoogle Scholar
- Gardner PP, Wilm A, Washietl S: A benchmark of multiple sequence alignment programs upon structural RNAs. Nucleic Acids Res. 2005, 33: 2433-2439. 10.1093/nar/gki541.PubMed CentralView ArticlePubMedGoogle Scholar
- Klein R, Eddy S: RSEARCH: finding homologs of single structured RNA sequences. BMC Bioinformatics. 2003, 4: 44-10.1186/1471-2105-4-44.PubMed CentralView ArticlePubMedGoogle Scholar
- Myers E, Miller W: Optimal alignment in linear space. Comput Appl Biosci. 1988, 4: 11-17.PubMedGoogle Scholar
- Zhong C, Tang H, Zhang S: RNAMotifScan: automatic identification of RNA structural motifs using secondary structural alignment. Nucleic Acids Res. 2010, 38: e176-10.1093/nar/gkq672.PubMed CentralView ArticlePubMedGoogle Scholar
- Ziv-Ukelson M, Gat-Viks I, Wexler Y, Shamir R: A faster algorithm for RNA co-folding. Proceedings of the 8th Workshop on Algorithms in Bioinformatics. 2008, Berlin Heidelberg: Springer-Verlag, 174-185.View ArticleGoogle Scholar
- Backofen R, Tsur D, Zakov S, Ziv-Ukelson M: Sparse RNA folding: time and space efficient algorithms. J of Discrete Algorithms. 2011, 9: 12-31. 10.1016/j.jda.2010.09.001.View ArticleGoogle Scholar
- Salari R, Mȯhl M, Will S, Sahinalp SC, Backofen R: Time and space efficient RNA-RNA interaction prediction via sparse folding. Proceedings of the 14th International Conference on Research in Computational Molecular Biology. 2010, Berlin Heidelberg: Springer-Verlag, 473-490.View ArticleGoogle Scholar
- Will S, Reiche K, Hofacker IL, Stadler PF, Backofen R: Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering. PLoS Comput Biol. 2007, 3: e65-10.1371/journal.pcbi.0030065.PubMed CentralView ArticlePubMedGoogle Scholar
- McCaskill JS: The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers. 1990, 29: 1105-1119. 10.1002/bip.360290621.View ArticlePubMedGoogle Scholar
- Hofacker IL, Fontana W, Stadler PF, Bonhoeffer LS, Tacker M, Schuster P: Fast folding and comparison of RNA secondary structures. Monatsh Chem. 1994, 125: 167-188. 10.1007/BF00818163.View ArticleGoogle Scholar
- Washietl S, Hofacker I, Stadler P: Fast and reliable prediction of noncoding RNAs. Proc Natl Acad Sci USA. 2005, 102: 2454-2459. 10.1073/pnas.0409169102.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.