Efficient alignment of RNA secondary structures using sparse dynamic programming

Background Current advances of the next-generation sequencing technology have revealed a large number of un-annotated RNA transcripts. Comparative study of the RNA structurome is an important approach to assess their biological functionalities. Due to the large sizes and abundance of the RNA transcripts, an efficient and accurate RNA structure-structure alignment algorithm is in urgent need to facilitate the comparative study. Despite the importance of the RNA secondary structure alignment problem, there are no computational tools available that provide high computational efficiency and accuracy. In this case, designing and implementing such an efficient and accurate RNA secondary structure alignment algorithm is highly desirable. Results In this work, through incorporating the sparse dynamic programming technique, we implemented an algorithm that has an O(n3) expected time complexity, where n is the average number of base pairs in the RNA structures. This complexity, which can be shown assuming the polymer-zeta property, is confirmed by our experiments. The resulting new RNA secondary structure alignment tool is called ERA. Benchmark results indicate that ERA can significantly speedup RNA structure-structure alignments compared to other state-of-the-art RNA alignment tools, while maintaining high alignment accuracy. Conclusions Using the sparse dynamic programming technique, we are able to develop a new RNA secondary structure alignment tool that is both efficient and accurate. We anticipate that the new alignment algorithm ERA will significantly promote comparative RNA structure studies. The program, ERA, is freely available at http://genome.ucf.edu/ERA.

: Schematic illustration of the technique used to achieve square space complexity. The shaded areas indicate the dynamic programming matrix that have already been filled. The blue shaded area indicates the scores on the edge that are stored, whose size is bounded by O(l). The yellow and green areas indicate the sequence alignment performed, which is resumed from the scores in the blue shaded area. The green area also indicates the scores to be stored for future sequence alignments.
Here we present an implementation technique that makes the ERA algorithm computable using only (O(max(zl, n 2 ))) space. Note that in the original algorithm, we need to store two twodimensional matrices for each pair of p A and p B for constant-time lookup of loop similarities. This indicates an O(n 2 l 2 ) space complexity of the original algorithm. Although the application of OPM can reduce the space complexity to O(zl 2 ) (we only store the two-dimensional matrices for OPMs instead of all pairs of p A and p B ), the space complexity is still too high for long RNA structures. Therefore, we propose a more efficient space handling technique to resolve this issue.
The key idea of our space handling technique is based on the observation that the computation of each optimal sequence alignment score depends on only three other scores. If we refer to S[i, j] as the optimal sequence alignment score between regions A[0...i] and B[0...j], then the computation of S[i, j] only depends on S[i − 1, j], S[i, j − 1], and S[i − 1, j − 1]. In this case, we do not need to store the entire two-dimensional matrix, but only the scores on the 'edge' of the sequence alignment matrix are sufficient. The sequence alignments for loop regions do not need to be pre-computed, but are resumed on demand using the recorded 'edge' scores. Since the number of scores on the edge is bounded by O(l), the space for storing loop similarities can be reduced to O(zl). Also note that because all sequence alignments are resumed from existing stages, no re-computation is possible. Therefore, applying this technique will not increase the time complexity.
A schematic illustration of this idea is presented in Figure S1. i,i (a multi-loop case). Once the loop similarity is required, we perform the sequence alignment on the corresponding regions (see detailed index in Figure S1 for enclosing or juxtaposing situations). The computed alignment results are indicated as the shaded regions. As soon as the demanded sequence similarity is returned, results stored in the shaded regions will not be requested in the future, since we do not allow crossing in RNA structures. In this case, we can discard the entire shaded regions except the scores on the edge (the blue shaded region in Figure S1). Clearly, storing the scores in the blue shaded region requires only O(l) space.
As the DP proceeds, the loop similarity associated o A,B i,i may be requested by other base pairs or OPMs. Let the base pairs be p A * and p B * (enclosing) or the OPM be o A,B * , * (juxtaposing) (see Figure S1). To compute the corresponding sequence similarity, we can utilize the previously stored edge scores to resume the alignment. In other words, the yellow and green regions can be computed if the scores in the blue shaded region are given. After returning the requested loop similarity, the blue and yellow regions can be discarded; and the green region is stored for further references. In this case, the space for recording all loop similarities can be reduced to O(zl), and the overall space complexity of this algorithm becomes O(max(zl, n 2 )) (where n 2 is used to store matrix M and M c ).
We show the space complexity of ERA using RNA structures with different lengths selected from Rfam in Figure S2. The O(l 2 ) space complexity is shown clearly. We also summarize the peak memory consumptions of ERA in Table S1.

S2 Selected RNA structures from Rfam
The RNA families and individual structures that are selected from the Rfam database are listed as the follows.

S4 Pruning base-pairing probability matrix for running LocARNA
In this section, we shown an original base-pairing probability matrix of a tmRNA, and a pruned base-pairing probability matrix of the same tmRNA in Figure S3. Figure S3: (a) The base-pairing probabilities of a tmRNA, which is used as LocARNA input when a number of structures from the ensemble are to be considered (when the minimum free energy structure is predicted with less confidence). (b) To benchmark ERA and LocARNA, we artificially remove base pairs that are not annotated. The remaining annotated base pairs are assigned with base pair probability of 1. In this case, we are able to input fixed RNA structures to LocARNA, and the comparison between ERA and LocARNA is fair.