A method for rapid similarity analysis of RNA secondary structures

Background Owing to the rapid expansion of RNA structure databases in recent years, efficient methods for structure comparison are in demand for function prediction and evolutionary analysis. Usually, the similarity of RNA secondary structures is evaluated based on tree models and dynamic programming algorithms. We present here a new method for the similarity analysis of RNA secondary structures. Results Three sets of real data have been used as input for the example applications. Set I includes the structures from 5S rRNAs. Set II includes the secondary structures from RNase P and RNase MRP. Set III includes the structures from 16S rRNAs. Reasonable phylogenetic trees are derived for these three sets of data by using our method. Moreover, our program runs faster as compared to some existing ones. Conclusion The famous Lempel-Ziv algorithm can efficiently extract the information on repeated patterns encoded in RNA secondary structures and makes our method an alternative to analyze the similarity of RNA secondary structures. This method will also be useful to researchers who are interested in evolutionary analysis.


Background
RNA secondary structures play an important role in determining the functions of RNA molecules. Some of them have been accepted as good data for evolutionary analysis. With the completion of the sequencing of the genomes of human and other species, major structural biology resources have been harnessed to predict functions. More and more RNA structures are accumulated and we know little about their functions. This calls for the development of cost-effective computational methods to predict RNA functions, which will provide preliminary information for biologists and guide biological experiments. Earlier studies usually adopt dynamic programming algorithms and tree models. Shapiro et al [1] proposed to compare RNA secondary structures by using tree models. Hofacker et al [2] compared RNA secondary structures by aligning the corresponding base pairing probability matrices that were computed by McCaskill's partition function algorithm [3]. Because these methods rely on dynamic programming algorithms, they are compute-intensive. Constructing tree models is based on the idea that the stems or helices dominantly stabilize the secondary structures. So they ignore their primary sequences and focus on so-called elementary units (stem and loop, etc) for the similarity analysis. There are other works, in which tree models were constructed to analyze the similarity of RNA secondary struc-tures [4][5][6][7][8]. Recently Liao et al [9] have proposed to use graphs to represent RNA secondary structures and then derive some invariants from graphs to compare RNA secondary structures. This idea is from the study of DNA sequences [10][11][12][13]. It has been stated [10] that invariants actually reflect some characterizations of biological structures or sequences and may be regarded as indicators. Some information will be lost, however, and how to obtain and select suitable invariants to characterize biological sequences so as to compare DNA sequences effectively is still unsolved. What's more, the graphical representations don't work well when the size of the RNA secondary structure is large. Obviously, for complex RNA secondary structures, more information is lost, which will affect the similarity analysis. Popular tools for optimal alignment of RNA secondary structures include RNAdistance [1], RNAforester [14] etc. RNAdistance uses the tree models to coarsely represent RNA secondary structures, and compares RNA secondary structures based on tree edit distance measure. RNAforester supports the computation of pairwise and multiple alignment of structures based on tree alignment measure.
In this paper we propose a novel method for the similarity analysis of RNA secondary structures, where pseudoknots are also taken into account. In our approach, each secondary structure is transformed into a linear sequence. The linear sequence not only contains the information on the corresponding RNA primary structure, but also contains the information on the base pairing.
Furthermore, standard and famous Lempel-Ziv algorithm [15] is employed for the similarity analysis. Of course, we have tested the validity of our method by analyzing three sets of real data. The results obtained by our method are comparable to those given by other authoritative methods. What's more, the whole process is easy to operate. It can yield results rapidly.

Materials
Three sets of real data are used to test our method. RNA secondary structures in set II are from RNase P and RNase MRP. They are distantly related and there is little sequence homology between them. These secondary structures are used to test distant RNA secondary structures. They are mainly obtained from the RNase P Database [16] and the remaining secondary structures are obtained from [17].

The similarity analysis of set I and set II by using our method
The goal of our study is to compare RNA secondary structures and analyze their similarity. Given a set of RNA secondary structures, our method requires the following main operations for the similarity analysis: Firstly the non-linear complex RNA secondary structures are transformed into linear characteristic sequences. Secondly, these linear sequences are decomposed according to the rule of Lempel-Ziv algorithm to evaluate the LZ complexity. Thirdly, the similarity degree between any two structures is measured by our distance formula, as shown in Method section. Lastly, by arranging all the values into a matrix, we obtain a pair-wise distance matrix. It contains the information on the similarity of this set of RNA secondary structures. We have used our method to analyze the similarity of set I and set II, respectively. Its validity may be better reflected by its application to reconstruct phylogenetic trees. Hence, for the two sets of data, we input their pair-wise distance matrices, obtained by our methods, into the Neighbor program in the Phylip package [19], respectively. By choosing Neighbor-joining option, we obtain two phylogenetic trees for the two sets, which are drawn by Treeview program [20] and are shown in Figure 1 and Figure 2.

Discussion
Lempel-Ziv algorithm is an algorithm that is related to minimal length encoding. Its successful application to the evolutionary analysis of DNA sequences has indicated that Lempel-Ziv algorithm is an alternative to the similarity analysis of biological sequences. To our knowledge, the concept of applying Lempel-Ziv algorithm to the similarity analysis of RNA secondary structures hasn't been adopted by any other researcher. The introduction of our method in Method section indicates that this is a relatively simple and rapid method for the similarity analysis of RNA secondary structures. We owe the efficiency of this method mainly to the Lemple-Ziv algorithm, which can effectively extract the repeated patterns encoded in linear sequences.
For comparison, we employ RNAforester program to perform the similarity analysis on the same data. This program calculates the similarity score for any pair of RNA secondary structures under the proposed scoring scheme. The similarity relationship is displayed in a cluster tree. By performing the RNAforester program on set I and set II, we obtain two cluster trees, as shown in Figure 3 and Figure 4. The numbers in the interior nodes of the cluster trees usually represent the similarity scores between the two sub-clusters that the interior nodes connect, respectively. Note that we set 0.7 as the clustering threshold when we run RNAforester program. Thus the similarity score that is not less than 0.7 will be replaced by 0 in the cluster tree. The efficiency of RNAforester program in analyzing the data from set I and set II is evaluated by Figure  3 and Figure 4.
At first, we compare Figure 1 with Figure 3. From Figure 1, we observe that: 1. Actinia equina, Chrysaora quinque and Planocera recticulata are grouped closely (they belong to Animalia); 2. Basidiobolus magnus and Christiansenis pallida are grouped closely (they belong to fungi); 3. Pyrodictium occultum, Halobacterium spl and Sulfolobus spl (they belong to Archaebacteria) are clearly separated from the rest; 4. Dicyema misakiense is placed closer to Animalia than to fungi (it belongs to mesozoa). The relationship described by our method is in accordance with the one described in [21,22]. In contrast to Figure 1, we find in Figure 3, obtained by using RNAforester program, that Halobacterium spl is separated from the cluster that Pyrodictium occultum and Sulfolobus spl belong to. Obviously this is not reasonable. Then we compare Figure 2 with Figure 4. From Figure 2, we observe that our result is consistent with the theory that is suggested in [23][24][25][26]: MRP evolved from a Eukaryotic Nuclear P in the nucleus of an early Eukaryote. Figure 2 indicates that mrpRNA are more similar to eukaryotic pRNA than to prokaryotic pRNA. Furthermore, Synechocystis sp.PCC6803, Anacystis nidulans PCC6301, Pseudoanabaena sp.PCC6903, Anabaena sp.PCC7120 and Porphyra purpurea chloroplast are grouped closely, named cluster I for convenience; Thermotoga mar-Neighbor-joining tree for the data in set I Figure 1 Neighbor-joining tree for the data in set I. It is obtained by our method and drawn by Treeview program.
itima, Agrobacterium tumefaciens and Rhodospirillum rubrum are grouped closely, named cluster II. Cluster I and cluster II are adjacent. In Figure 4, Halobacterium cutirubrum is put far away from Methanococcus jannaschii. Furthermore, Anacystis nidulans P is separated far from Synechocystis sp.P and Anabaena sp.P. Bacillus subtilis and Reclinomonas americana mitochondria aren't placed closely. This conformation doesn't accord with the one demonstrated by Collins et al.
In general, our method can compare secondary structures reasonably, with the results consistent with those from [23][24][25][26]. For the two data sets, our algorithm performs better than RNAforester program. Additionally, our analysis results favor the proposal that RNA secondary structures are useful materials for evolutionary analysis.
It seems that our method is heavily biased towards comparing sequences, not secondary structures. However, in fact, this is not the truth. We now apply Lempel-Ziv algorithm directly to RNA sequences to see whether the result obtained by this method is better than ours. As a result, the phylogenetic tree for the data in set II has much divergence from ours, shown in Figure 5 (drawn by Treeview).
It's obvious that there exists unreasonable topology that depicts the similarity relationship of these RNA secondary structures in Figure 5. For example, Thermotoga maritima, Agrobacterium tumefaciens and Rhodospirillum rubrum are placed close to the RNase MRP RNAs and are separated far away from the branch for Synechocystis sp.P and Anabaena sp.P, etc, which simultaneously leads to the separation of Neighbor-joining tree for the data in set II Figure 2 Neighbor-joining tree for the data in set II. It is obtained by our method and drawn by Treeview program.
Cluster tree for the data in set I Figure 3 Cluster tree for the data in set I. It is obtained by using RNAforester program. The tree is derived based on the similarity scores between any pair of RNA forests.
Cluster tree for the data in set II Figure 4 Cluster tree for the data in set II. It is obtained by using RNAforester program. The tree is derived based on the similarity scores between any pair of RNA forests.

Sulfolobus acidocaldarius from Methanococcus jannaschii and
Halobacterium cutirubrum. In nature, Thermotoga maritima, Agrobacterium tumefaciens and Rhodospirillum rubrum belong to Eubacterial RNase P and should be grouped close to Synechocystis sp.P and Anabaena sp.P, etc. Figure 5 has favored our claim, i.e. our characteristic sequences do grasp some information on RNA secondary structures (base pairing).
The introduction of the Lempel-Ziv algorithm to the similarity analysis makes our algorithm run fast. Table 1 lists the general time and space complexity of our method and RNAforester program. In Table 1, the relationship between the size (length) of RNA secondary structure and the time complexities hasn't been indicated explicitly for the RNAforester program. We may make approximate estimation. In theory, the total number of the nodes of an RNA forest scales linearly with the size of the RNA secondary structure. For RNA secondary structures that exist in nature, the maximum length of an unpaired region and the branching degree can be considered to be bounded by some constants, which determines that the degree of an Neighbor-joining tree for the data in set II Figure 5 Neighbor-joining tree for the data in set II. It is obtained by performing LZ algorithm on RNA primary structures, i.e. the step to extract linear characteristic sequences from RNA secondary structures has been ignored.
RNA forest is expected to stay a constant. Hence the running time O(|F 1 ||F 2 |deg(F 1 ))deg(F 2 )) [27] is equivalent to O(n 1 n 2 ), where n 1 n 2 is the product of the sizes of the two RNA secondary structures being compared.
On the other hand, we have compared the execution time of our method with that of RNAforester by using some RNA secondary structures of various sizes. The results are listed in Table 2. It's obvious that our algorithm performs faster.
Additionally, we have performed our program on a set of 16S rRNAs, whose secondary structures are more complex and the sizes of which are relatively larger. The result is shown in Figure 6, drawn by Treeview. Their similarity relationship has been reasonably derived by our method. Thermoproteus tenax, Halobacterium, Methanococus vannielli, Thermococcus celer and Methanobacterium have been clustered together, which is consistent with the fact that they are of Archaea. Mus musculus, Saccharomyces cerevisiae, Homo sapiens, Vairimorpha are clustered together, which is consistent with the fact that they are of Eucaya. The left are of Bacteria.

Conclusion
Here we have proposed a new method to analyze the similarity of RNA secondary structures (pseudoknots are taken into account). It is a simple method that yields results reasonably and rapidly. Our algorithm is not necessarily an improvement as compared to some existing methods, but an alternative for the similarity analysis of RNA secondary structures. The new method doesn't require sequence alignment and the construction of tree models. It is based on linear characteristic sequences that we define for RNA secondary structures and the famous Lempel-Ziv algorithm that relates to minimal length encoding. The characteristic sequences contain the infor-mation from RNA primary structures and the base pairs formed in RNA secondary structures. The Lempel-Ziv algorithm effectively extracts the information on the repeated patterns encoded in long sequences. The example applications of our method to three sets of real data and its comparison with other methods verify the validity of our method. From the comparisons, we conclude that our method performs well on distantly related RNA secondary structures. In our approach, complicated computation is avoided. The whole process is easy to operate. What's more, the size of RNA secondary structure is not problematic.
Of course, there is defect in our approach: when non-linear RNA secondary structures are transformed into linear characteristic sequences, some information may be lost. However, our test has indicated that our method can yield results reasonably, i.e. our method can extract some key information from RNA secondary structures.

Lempel-Ziv algorithm and LZ complexity
Let S, Q and R be sequences over a finite alphabet Λ, l(S) be the length of S, S(i) be the ith element of S and S(i,j) be the subsequence of S that starts at position i and ends at position j. Note that S(i,j) = ∅, for i > j. The contatenation of Q and R forms a new sequence S = QR, where Q is called a prefix of S, and S is called an extension of Q if there exists an integer i such that Q = S (1, i).
a |F i | is the number of nodes in the forest F i and deg(F i ) is the degree of F i ; b N is the average size.
The difference between producibility and reproducibility is that the former allows for an extra "different" symbol at the end of the extension process which is not permitted in the latter. Therefore an extension which is reproducible is always producible but the reverse may not always be true.
Any non-null sequence S can be built from a production process by iterative self-deleting-building process where at

is EH(S)= U•UC•G•A•GG•UCGG•A.
Let c(S) be the number of components in the exhaustive history of S. It is the least possible number of steps needed to generate S according to the whole Lempel-Ziv algorithm, so c(S) becomes an important complexity indicator.

Linear characteristic sequences of RNA secondary structures
Usually, A, C, G, U are used to denote the four bases(nucleotides) in RNA sequences (primary structures). An RNA sequence can thus be represented by R = Neighbor-joining tree for the data in set III Figure 6 Neighbor-joining tree for the data in set III. It is obtained by our method and drawn by Treeview program.
We can find that we need 2 steps to build R from S, 4 steps to build R from Q, 4 steps to build Q from S. So we say R is more similar to S than to Q. The reason is that S and R share the common patterns UAC and UAA.
Based on this hypothesis, Otu et al have used the Lempel-Ziv algorithm to successfully construct phylogenetic trees from DNA sequences, which verifies the efficiency of Lempel-Ziv algorithm in analyzing the similarity of linear biological sequences. Therefore we adopt the following formula to evaluate the distance between secondary structures S and Q, which is slightly different from [28]: The denominator in [28] is equivalent to [c(L(S)L(Q))+c(L(S)L(Q))]/2, which leads to the fact that the distance calculated by the formula proposed in [28] will always be twice as much as the distance calculated by our formula. As you know, a constant will not affect the similarity analysis at all. We choose to use the formula mentioned above mainly because its expression is simpler. The formula in [28] has been proven to be a distance metric by Out et al. Thus Rd(S, Q) also satisfies the conditions required by a distance metric.
It's obvious that the more similar S is to Q, the smaller c(L(S)L(Q))-c(L(S)) and c(L(Q)L(S))-c(L(Q)) are, and then the smaller Rd(S,Q) is.
Generally, given n RNA secondary structures S 1 , S 2 ,......, S n , we can obtain their linear characteristic sequences by the above-mentioned rule, which are L(S 1 ), L(S 2 ),......, L(S n ). They are linear sequences defined over alphabet {A, C, G, U, a, c, g, u,} and carry the information on RNA secondary structures. Then, by using Lempel-Ziv algorithm, the distance between any pair of structures, Rd(S i , S j ), may be rapidly computed. By arranging them into a matrix, a pairwise distance matrix is obtained, denoted by RD. RD =(Rd(S i , S j )) contains the information on the similarity/ dissimilarity between any pair of RNA secondary structures.
Based on what Otu et al have proven, we can easily conclude that our distance metric is also valid for inferring phylogeny of species because it satisfies all the conditions for phylogeny analysis. Hence our pair-wise distance matrix may be input the programmes for phylogeny inference to study the phylogeny for RNA molecules based on their secondary structures. , 0