Skip to main content
  • Research Article
  • Open access
  • Published:

LCS-TA to identify similar fragments in RNA 3D structures

Abstract

Background

In modern structural bioinformatics, comparison of molecular structures aimed to identify and assess similarities and differences between them is one of the most commonly performed procedures. It gives the basis for evaluation of in silico predicted models. It constitutes the preliminary step in searching for structural motifs. In particular, it supports tracing the molecular evolution. Faced with an ever-increasing amount of available structural data, researchers need a range of methods enabling comparative analysis of the structures from either global or local perspective.

Results

Herein, we present a new, superposition-independent method which processes pairs of RNA 3D structures to identify their local similarities. The similarity is considered in the context of structure bending and bonds’ rotation which are described by torsion angles. In the analyzed RNA structures, the method finds the longest continuous segments that show similar torsion within a user-defined threshold. The length of the segment is provided as local similarity measure. The method has been implemented as LCS-TA algorithm (Longest Continuous Segments in Torsion Angle space) and is incorporated into our MCQ4Structures application, freely available for download from http://www.cs.put.poznan.pl/tzok/mcq/.

Conclusions

The presented approach ties torsion-angle-based method of structure analysis with the idea of local similarity identification by handling continuous 3D structure segments. The first method, implemented in MCQ4Structures, has been successfully utilized in RNA-Puzzles initiative. The second one, originally applied in Euclidean space, is a component of LGA (Local-Global Alignment) algorithm commonly used in assessing protein models submitted to CASP. This unique combination of concepts implemented in LCS-TA provides a new perspective on structure quality assessment in local and quantitative aspect. A series of computational experiments show the first results of applying our method to comparison of RNA 3D models. LCS-TA can be used for identifying strengths and weaknesses in the prediction of RNA tertiary structures.

Background

A comparison of contents stored in NCBI Reference Sequence Database (RefSeq) [1] and Protein Data Bank (PDB) [2] brings to a conclusion that there is a large, ever-widening gap between the numbers of known sequences and structures of biomolecules. Today, this gap is being filled with the use of computational methods that address the problem of RNA and protein 3D structure prediction. Following that, a necessity to estimate the quality of computational models and fidelity of predictors arises. Since the 1990s, CASP (Critical Assessment of protein Structure Prediction) experiment has taken the challenge of assessing protein structure prediction [3]. RNA-Puzzles initiative launched in 2011 and drawing on the solutions implemented in CASP, followed to support the RNA community [4, 5]. Both experiments have significantly contributed to a development of measures and methods for validation and assessment of 3D structure models predicted in silico [6]. The resulting algorithms have been applied not only in the evaluation of predicted proteins and RNAs. They are also used for validation and analysis of experimentally solved structures, clustering 3D models, identification of structure motifs, tracking conformational changes, exploring the sequence-structure relationship, etc. [6,7,8,9,10,11,12,13,14].

RNA-Puzzles, a collective experiment for blind RNA structure prediction, uses the following approaches to assess submitted RNA 3D models: (i) Root Mean Square Deviation (RMSD), (ii) Interaction Network Fidelity (INF) [15], (iii) Deformation Index (DI), (iv) Clash score by MolProbity [16], and (v) Mean of Circular Quantities (MCQ) [17]. Except that, a few other RNA evaluation methods have been developed and applied in various projects [8, 18]. All of them relate to various attributes of the considered RNA 3D structures, but their common feature is that the structures are mainly evaluated globally. Similarly, most structure assessment methods in CASP treat protein models globally, and only a few touch an aspect of local similarity. Such approach is fully understood and seems sufficient when we deal with the evaluation and ranking of many models submitted to the competition. However, when analyzing individual structures, finding their strengths and weaknesses, comparing substructures, or identifying motifs, a local assessment is necessary. In such cases, local evaluation of the 3D model complements global analysis and significantly enhances our knowledge of the structure.

So far, one approach has been proposed to enable a local view on predicted RNA 3D model compared to the target structure. It is based on a concept of spheres built along RNA backbone and providing the scene for preview and RMSD-based evaluation of sphere-enclosed atom subsets. It has been first implemented as a standalone application named RNAlyzer [8], and later released as RNAssess webserver [19]. In the case of proteins, Local-Global Alignment (LGA) is one of the most common approaches enabling local analysis [20]. LGA comprises two methods, Longest Continuous Segments (LCS) and Global Distance Test (GDT). The first one identifies the longest continual fragment within predicted protein structure which – compared to the target – has the RMSD below a given threshold. The second method computes the percentage of residues fitting below predefined distance cut-off. LGA is the reference method used to evaluate protein structures in CASP.

The methods mentioned in the previous paragraph operate in Euclidean space where each structure is represented as a set of atoms with coordinates in the Cartesian system. As all other approaches which consider molecule structures in Euclidean space and apply RMSD-based evaluation, they deal with the computationally demanding problem of optimum 3D structure alignment. This problem can be omitted when switching to the space of torsion angles. The 3D structure of RNA can be represented by a set of eight torsion angles that describe the course of its backbone and arrangement of the bases. Such representation makes a comparison of structures independent of their alignment in space and simplifies the computation. This concept has been followed in MCQ4Structures method [17] that expresses structure similarity as Mean of Circular Quantities (MCQ).

Here, we propose a new method that integrates a concept of RNA 3D structure comparison in the space of torsion angles [17] with the idea of identifying longest continuous segments displaying local similarity [20]. Two segments are considered similar if their MCQ value is below the predefined threshold. The method has been implemented as LCS-TA algorithm (Longest Continuous Segments in Torsion Angle space) and incorporated into MCQ4Structures software. It is freely available at http://www.cs.put.poznan.pl/tzok/mcq/.

Methods

LCS-TA has been designed as the local similarity measure. It aims to compare two RNA 3D structures, S (structure of the target) and S′ (structure of the model), and identify similar fragments within them. It runs either in sequence-independent or sequence-dependent mode. In the first mode, the compared structures can have different lengths, and the relationship between their residues can be unknown. Thus, no preliminary analysis of the sequences of S and S′ is required here. In the second mode, the method processes structures of the same length. LCS-TA operates in the space of torsion angles, so it is superposition-independent and does not involve finding the optimum alignment of structures. The method scans both structures stepwise along their backbones and uses a moving search window to select segments for a comparison. In this routine, a divide and conquer formula is followed to determine the window size in each step. For a pair of window-highlighted segments, LCS-TA computes MCQ value over a set of torsion angles related to the segments. Next, it checks whether the MCQ value is below the threshold. At the output, LCS-TA provides the length of the longest continuous segment satisfying similarity condition (i.e., fitting below the threshold) and segment location (its first and last residue numbers). The resulting segment’s length (referred to as LCS) is the measure of local similarity. Both components of the method, that is divide and conquer procedure and MCQ-based measure, are described in the following paragraphs.

Divide and conquer procedure

Divide and conquer (D&C) is a technique used to optimize the process of solving the problem by recursively splitting it into smaller subproblems and using their solutions to build the solution of the input problem. In our method, we apply D&C approach to determine lengths of the search window in consecutive steps of the algorithm. The example recursion tree visualizing divide-and-conquer-driven computation in LCS-TA algorithm is presented in Fig. 1.

Fig. 1
figure 1

Example recursion tree in LCS-TA algorithm

The initial window size in LCS-TA is equal to the number n of residues in the predicted model (WinSize = n). In each iteration, the algorithm checks whether a feasible solution (namely continuous segment with MCQ below the threshold) exists for current window size. In the case of a negative result, WinSize is divided by 2 (and rounded up to the least succeeding integer). Otherwise, it is incremented to a value halfway between current size and WinSize of grandparent iteration (i.e., iteration i-2, where i is the order number of current iteration) except the first iteration where n-1 is taken as an upper bound of WinSize. Next, the computation runs recursively for both sizes of the search window, thus branching into two subproblems. The algorithm stops if further reduction of the window size is impossible (WinSize = 1) and all possible solutions for that WinSize value have been checked, or if the optimum solution is found. Such computation pattern, known as binary tree recursion, is one of the most commonly used in the implementation of the D&C method. Its time complexity is O(log2 n), where n is the instance size (in our problem n is the number of residues in S′ – structure of predicted model).

MCQ-based measure

The MCQ-based distance measure has been developed for trigonometric representation of the molecule 3D structure [17]. In this representation, a shape of every RNA residue is described by eight torsion angles from the set T = {α, β, γ, δ, ε, ζ, P, χ}. Each torsion angle in RNA molecule is defined by atom quadruple (the details can be found in [17, 21]) and determines rotation around particular chemical bond. It is computed as a dihedral angle between two planes defined by a pair of overlapping atom triples. Having a chain A-B-C-D of four atoms, we can easily determine the torsion angle between the plane passing through A, B, C, and the plane passing through B, C, D.

When the RNA structure is composed of n residues, then its trigonometric representation is a matrix containing 8n values of torsion angles t ij , where i = 1,...,n, j = 1,...,|T|, and T is a set of torsion angles defined for RNA (t ij is torsion angle of type j within residue i). To measure the distance between two structures, S and S′, of equal length (n residues), given in trigonometric representations, we apply formula (1) for computing mean of circular quantities [17]:

$$ \mathrm{MCQ}\left(S,{S}^{\prime}\right)=\arctan \left({\sum}_{i=1}^n{\sum}_{j=1}^{\left|T\right|}\sin \varDelta \left({t}_{ij},{t}_{ij}^{\prime}\right),{\sum}_{i=1}^n{\sum}_{j=1}^{\left|T\right|}\cos \varDelta \left({t}_{ij},{t}_{ij}^{\prime}\right)\right) $$
(1)

The two-argument arctan(y, x) is used to distinguish results from the whole range [−π; π). This is possible, because the function calculates angle value from the positive X half-axis to the vector between points (0, 0) and (x, y) in a Cartesian coordinate system. In particular, this means that, unlike one-argument \( \arctan \left(\raisebox{1ex}{$y$}\!\left/ \!\raisebox{-1ex}{$x$}\right.\right) \) the two-argument variant is well-defined for x = 0 and in general arctan(y, x) ≠ arctan(−y, −x) which is not true for one-argument function.

In formula (1), the following function is used to obtain the distance between two angles:

$$ \varDelta \left(t,{t}^{\prime}\right)=\left\{\begin{array}{ll}0\hfill & \mathrm{If}\ t\ \mathrm{and}\ {t}^{\prime }\ \mathrm{are}\ \mathrm{undefined}\hfill \\ {}\uppi \hfill & \mathrm{if}\ \mathrm{either}\ t\ \mathrm{or}\ {t}^{\prime }\ \mathrm{is}\ \mathrm{undefined}\hfill \\ {}\min \left\{\mathrm{diff}\left(t,{t}^{\prime}\right),2\uppi \hbox{-} \mathrm{diff}\left(t,{t}^{\prime}\right)\right\}\hfill & \mathrm{otherwise}\hfill \end{array}\right. $$
(2)

Where

$$ \mathrm{diff}\left(t,{t}^{\prime}\right)=\left|\operatorname{mod}(t)\hbox{-} \operatorname{mod}\left({t}^{\prime}\right)\right| $$
(3)

and

$$ \operatorname{mod}(t)=\left(t+2\pi \right)\ \mathrm{modulo}\ 2\uppi $$
(4)

MCQ has been defined as a distance measure, and it shows the dissimilarity of two three-dimensional structures of the same length. Thus, the greater is its value, the more the two structures differ. And accordingly, the smaller the MCQ value, the greater is the similarity of compared structures.

It should be noted, that set T of torsion angles defined for RNA originally contained eight types of angles. However, MCQ is flexible, and any subset of T can be used to measure it. For example, if the user is interested to consider ribose ring only, then MCQ can be computed involving pseudotorsion angle P (or, alternatively, τ0, τ1, τ2, τ3, τ4 angles). In the presented version of the algorithm we use original set T = {α, β, γ, δ, ε, ζ, P, χ}.

Finally, let us add that originally MCQ value is computed in radians. In our application, it is next converted into degrees and so presented to the user.

LCS-TA algorithm

The LCS-TA algorithm compares two RNA 3D structures (hereby referred to as the target and the model) provided in PDB or mmCIF file formats. At the input, the user should also specify the MCQ threshold value in degrees and select the mode (sequence-independent or sequence-dependent). At the output, the algorithm provides the longest continuous segment (its location within both structures), its length and actual MCQ value. If more than one solution exists, all of them are shown to the user.

LCS-TA applies divide and conquer approach (Fig. 1) to find the optimum solution, i.e., the longest continuous segment in the model whose MCQ-based similarity to the target fragment is below the specified MCQ threshold. The computation proceeds as follows. First, the algorithm computes MCQ between entire structures. If its value does not exceed the threshold, the whole model structure is returned as the optimum solution. Otherwise, the size of the current search window is determined according to the D&C procedure described in the previous sections. Next, a set of candidate segments is constructed based on the model structure: the search window moves along the model from its 5′ to 3′-end, and all window-highlighted fragments are put into the candidate set. Thus, the current candidate set contains all segments with length equal to the current window size. After that, for every segment from the candidate set the algorithm checks if it is a feasible solution. This part of the algorithm differs between the modes. In the sequence-independent mode, the check is done by positioning the candidate segment stepwise along the target structure, i.e., the candidate segment moves along the target structure every single residue. In the sequence-dependent mode, the candidate segment is compared to the corresponding fragment of the target structure. Two sets of torsion angles, one describing the candidate and the other describing the target segment, are computed. Based on that, the MCQ value between the positioned segments is determined. If the MCQ is below the user-defined threshold, the candidate segment is a feasible solution. If the feasible solution exists in the candidate set, the algorithm tries to find the longer segment (window size is enlarged for the next iteration). Otherwise, shorter segments are considered (window size is reduced for the next iteration). The procedure iterates until the stopping condition is satisfied.

Below, we show the pseudocode of LCS-TA focusing on the general steps of the algorithm running in the sequence-independent mode. In the sequence-dependent mode, the comparison of corresponding segments is done within one FOR EACH loop, instead of two nested loops.

figure a

The LCS-TA algorithm in sequence-independent mode runs with the worst-case computational complexity of O(n 2log2 n). In the sequence-dependent mode the complexity is O(nlog2 n), where n denotes the number of residues in the predicted model. This computational complexity is due to the complexity of D&C being O(log2 n), and the number of comparisons performed for every candidate segment in a single iteration.

Accessibility and usage

LCS-TA algorithm has been implemented as a new functionality of MCQ4Structures [17], running as standalone Java Web start application. It is freely available for download at http://www.cs.put.poznan.pl/tzok/mcq/.

Results and discussion

In this section, we present the results of LCS-TA experimental runs over selected RNA 3D structures. We analyze the algorithm’s output in the case of structure processing in sequence-independent and sequence-dependent mode, and we observe the impact of MCQ threshold value on local and global similarity assessment.

For a pair of compared RNA structures, LCA-TA algorithm provides the following output data: (i) LCS - a length of optimum solution (the longest continuous segment) measured as the number of residues in the segment, (ii) target structure coverage by the resulting segment, that is the ratio of segment to structure length (in percentages), (iii) actual MCQ value of the segment, and (iv) segment location within the structures (number of the first and last residue). If more than one optimum solution exists for two input structures, all of them are given to the user. The data are provided in plain text format and can be downloaded as CSV file.

In the first experiment, we have run LCS-TA algorithm for two RNA 3D models submitted to RNA-Puzzles challenge 18 which was compared to the target structure of exonuclease resistant RNA from Zika virus (PDB id: 5TPY) [22]. Model 1 predicted by RNAComposer [23, 24] in the server category, and model 1 submitted by Chen group [25] in the human category were selected for examination. In the paper, they are referred to as RNAComposer_1 and Chen_1, respectively. Both models were processed by LCS-TA running in two modes, sequence-independent and sequence-dependent one. In each mode, we have planned to apply the following values of MCQ threshold: 5, 10, 15, 20, 25, 30, 35 and 40 degrees. The experiment runs with MCQ threshold set to 5° returned no optimum solution for any model. On the other hand, for MCQ threshold equal to 25° the algorithm output the entire 71 nt-long structure with actual MCQ value of 23.48° in the case of RNAComposer_1, and 23.81° for Chen_1 model. This meant that MCQ of the whole model was below 25°-threshold in both cases. With 25° constituting the breakout point of the experiment no further increasing of the threshold was necessary.

Tables 1 and 2 present the results of RNAComposer_1 and Chen_1 models’ processing by LCS-TA with respect to the target structure in sequence-independent and sequence-dependent mode, respectively. For every MCQ threshold between 10° and 25°, we can see the position of the longest continuous segment within the model (and the target) marked with a value of 1 in the character string, segment size (LCS) and its actual MCQ value. In any case, RNAComposer_1 model dominates Chen_1, as far as LCS value is concerned. In all cases except one, the single optimum solution has been found. Only for MCQ threshold set to 10°, three segments with LCS = 9 have been identified within RNAComposer_1 model in sequence-independent mode. A closer look at the results makes us find that the most significant diversity in segment length and location within both models is observed for MCQ threshold equal to 20°. Solutions obtained for this threshold value have been visualized using PyMOL in Figs. 2 and 3. In every figure, the longest continuous segment identified in the model (colored) has been superimposed onto the target structure (grey) at the location of the corresponding target segment. As shown in the figures, different segments have been identified in the considered models.

Table 1 Longest segments found in the sequence-independent mode for RNAComposer_1 and Chen_1 models of 5TPY structure
Table 2 Longest segments found in the sequence-dependent mode for RNAComposer_1 and Chen_1 models of 5TPY structure
Fig. 2
figure 2

Longest segments (colored) found in sequence-independent mode, MCQ threshold = 20°, within (a) RNAComposer_1 and (b) Chen_1 models, aligned onto the target 5TPY structure (gray)

Fig. 3
figure 3

Longest segments (colored) found in sequence-dependent mode, MCQ threshold = 20°, within (a) RNAComposer_1 and (b) Chen_1 models, aligned onto the target 5TPY structure (gray)

To complete similarity analysis in the first experiment, we have decided to use the other similarity measure for evaluating LCS-TA results. It can be assumed that two fragments with similar torsion display the similarity also in the space of atom coordinates. Thus, to verify this assumption, we have processed RNAComposer_1 and Chen_1 models using RNAssess [19]. This tool supports the identification of local similarity between two RNA 3D structures in the sequence-dependent mode. RNAssess compares model and target structures using the idea of moving spheres and computing RMSD between RNA fragments included in the corresponding spheres (one sphere positioned in the model, the second one – in the target). The results of the comparison are provided in the graphical form (line graphs, 2D and 3D maps). To present the results of RNAComposer_1 and Chen_1 processing with reference to the target structure, we have selected 2D maps (see Fig. 4). The value of RMSD computed for sphere positioned in particular place along RNA chain is represented by colour. Dark blue areas represent fragments of high similarity. It can be observed that location of fragments identified by LCA-TA (Table 2) coincides with dark blue areas of RNAssess maps (Fig. 4). Thus, for our example structures, the similarity in torsion angle space is accompanied by the similarity in Euclidean space of atom coordinates. This is true for MCQ threshold not exceeding 20 degrees (above this threshold LCS-TA returns the whole structure as a result). Our analysis finished with computing RMSD for identified fragments of RNAComposer_1 and Chen_1 models. In the case of fragments found within RNAComposer_1 model in sequence-dependent mode, their RMSD values were equal to 0.702 Å for MCQ threshold = 10° and 0.959 Å for MCQ threshold = 15°, while the global RMSD of RNAComposer_1 equals 24.48 Å. For Chen_1 the RMSD of the LCS-TA-provided fragment was 2.011 Å for MCQ threshold = 15° (no feasible solution was found in this model for smaller threshold), while global RMSD of the model was only 3.144 Å.

Fig. 4
figure 4

Results of (a) RNAComposer_1 and (b) Chen_1 models comparison to the target structure (5TPY) by RNAssess

In the second experiment, we have investigated multiple models predicted in RNA-Puzzles challenge 18 and challenge 19. Altogether, 53 models were submitted in challenge 18, and 54 in challenge 19. From these sets, we have selected one model per each participant (namely, model 1) and we compared it to the target structure, i.e., exonuclease resistant RNA from Zika virus (PDB id: 5TPY) [22] in challenge 18, and twister sister (TS) ribozyme (PDB id: 5T5A) [26] in challenge 19. Experimental results concerning the selected models are presented in Tables 34 and Fig. 5 for challenge 18, and Tables 56 and Fig. 6 for challenge 19. In the tables, one can see LCS value, i.e., the length of the resulting segment found within each model for different MCQ thresholds, and actual MCQ of this segment. The best solution (LCS of the longest continuous segment found among all models) in human and server category is printed in bold. If more models include a segment with the biggest LCS, the one with the smallest actual MCQ is considered the winner. The figures complement tabular data by showing, for each model and MCQ threshold, the percentage of target structure covered by the optimum solution.

Table 3 LCS-TA results for predicted models of 5TPY structure in the sequence-independent mode
Table 4 LCS-TA results for predicted models of 5TPY structure in the sequence-dependent mode
Fig. 5
figure 5

LCS-TA results for predicted models of 5TPY in (a) sequence-independent and (b) sequence-dependent mode

Table 5 LCS-TA results for predicted models of 5T5A structure in the sequence-independent mode
Table 6 LCS-TA results for predicted models of 5T5A structure in the sequence-dependent mode
Fig. 6
figure 6

LCS-TA results for predicted models of 5T5A in (a) sequence-independent and (b) sequence-dependent mode

Eleven participants submitted their predictions for challenge 18. Thus, 11 RNA 3D models were selected for the analysis with LCS-TA (Tables 34, Fig. 5). This number includes six human predictions (Fig. 5, solid lines) and five server-predicted ones (Fig. 5, dotted lines). In the human category, the Das_1 model has appeared to win for all MCQ thresholds. Among server predictions, RW3D_1 model, generated by Das server (unpublished), has been the best. This is true for both modes of LCS-TA. In the case of sequence-independent analysis and MCQ threshold set to 10°, RW3D_1 dominates Das_1 (Table 3). However, this relationship is not the same in the sequence-dependent mode (Table 4). A comparison of the results for Das_1 and RW3D_1 with MCQ threshold = 10° in both modes shows that there is one, accurately predicted 12 nt-long segment in Das_1 which is identified by LCS-TA in both modes. However, for RW3D_1 the longest segment below 10° threshold (with LCS = 18) corresponds very well to the other part of the target structure. This influences the overall quality of RW3D_1 prediction and makes it globally a little worse than that of Das_1. Nevertheless, the accuracy and quality of both models are very high. MCQ computed for each of these models in total, does not exceed 20 degrees. Thus, starting from threshold set to 20°, the optimum solution in both cases covers 100% of the structure (Fig. 5).

Challenge 19 has also attracted 11 participants, including six in the human category (Fig. 6, solid lines) and five in the group of servers (Fig. 6, dotted lines). Thus, 11 predicted models were processed with LCS-TA (Tables 56 and Fig. 6). This experiment’s results show a greater diversity in the relationship between the models than in the case of challenge 18. In the human category, the situation is similar for both LCS-TA modes. Das_1 proves the best for MCQ threshold = 5°, however, when the threshold value increases by accepting values 10, 15, 20, 25 and 30 degrees, RNAComposerH_1 dominates all other models as far as LCS and actual MCQ are concerned. In the server category, the longest segments have been found in RNAComposer_1 [23, 24], RW3D_1 and simRNA_1 [27] models, depending on the MCQ threshold and LCS-TA mode. This shows that although globally the considered models seem quite similar, the differences on a local level can be significant. Thus, local analysis of the model can indicate the direction for further development and improvement of the prediction approach. From these results, we can also see that global ranking of models based on LCS-TA value highly depends on the MCQ threshold.

Molecules selected for the above analysis are medium-size RNA structures. Their processing by both alignment-based and alignment-free algorithms is possible, although it is more time-consuming in the case of the first group of methods. The difference between computing times by both groups increases significantly with the increase in molecule size. The length of RNA chain can also influence the quality of results generated by alignment-based algorithms which provide a suboptimum solution. However, this is not the case of alignment-free approach, including LCS-TA. To show that our algorithm also works for longer RNAs, we have applied it to process RNA 3D models submitted to RNA-Puzzles challenge 7 and challenge 8. In the first case, we have chosen one model per each participant (namely, model 1) and we compared it to the target structure of Varkud satellite ribozyme (PDB id: 4R4V) [28]. Similarly, the first model submitted by each participant in challenge 8 was selected and analyzed with reference to the target structure of SAM I/IV-riboswitch (PDB id: 4 L81) [29]. Altogether, we have processed seven models from challenge 7 and 6 models from challenge 8. For all cases LCS-TA algorithm provided the results, finding similar fragments positioned along the entire structure. These experiments’ results are presented in Additional file 1.

Conclusions

In the paper, we have addressed the problem of identifying similar fragments within RNA 3D structures and tertiary structure similarity assessment on the local level. We have introduced LCS-TA method that finds fragments displaying high similarity in torsion angle space. The method has been implemented in Java and added to MCQ4Structures standalone application, freely available at http://www.cs.put.poznan.pl/tzok/mcq/. We have shown an example application of the method in processing and analysis of RNA 3D structures predicted within RNA-Puzzles challenge 18 and 19.

Our algorithm is computationally non-demanding and user-friendly. At the input, it requires PDB or mmCIF files with RNA 3D structures and MCQ threshold value. The results are easy to compare and interpret. Thus, we hope it will be of wide interest in the RNA community.

LCS-TA has the potential to open new avenues in the RNA structural bioinformatics, particularly in the field of evaluating predicted RNA 3D models, local similarity assessment, as well as in structure motif/module identification and examination. Our future works will follow in this direction. We are going to perform large-scale tests of the method to define reliable MCQ thresholds. We plan to analyze the relationship between LCS-TA results and the secondary structure motifs of the analyzed RNA structures. This kind of analysis can indicate RNA motifs or fragments which are particularly hard (or easy) to predict. Finally, we plan to supplement the algorithm with the graphical output.

Abbreviations

CASP:

Critical Assessment of protein Structure Prediction

CSV:

Comma-Separated Values

D&C:

Divide and conquer

GDT:

Global Distance Test

INF:

Interaction Network Fidelity

LCS:

Longest Continuous Segments

LCS-TA:

Longest Continuous Segments in Torsion Angle space

LGA:

Local-Global Alignment

MCQ:

Mean of Circular Quantities

RMSD:

Root Mean Square Deviation

References

  1. Pruitt KD, Tatusova T, Brown GR, Maglott DRNCBI. Reference sequences (RefSeq): current status, new features and genome annotation policy. Nucleic Acids Res. 2012;40:D130–5.

    Article  CAS  PubMed  Google Scholar 

  2. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids Res. 2000;28:235–42.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Moult J, Pedersen JT, Judson R, Fidelis KA. Large-scale experiment to assess protein structure prediction methods. Proteins. 1995;23:ii–v.

    Article  CAS  PubMed  Google Scholar 

  4. Cruz JA, Blanchet MF, Boniecki M, Bujnicki JM, Chen SJ, Cao S, et al. RNA-puzzles: a CASP-like evaluation of RNA three-dimensional structure prediction. RNA. 2012;18:610–25.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Miao Z, Adamiak RW, Antczak M, Batey RT, Becka A, Biesiada M, et al. RNA-puzzles round III: 3D RNA structure prediction of five riboswitches and one ribozyme. RNA. 2017;23:655–72.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Miao Z, Westhof E. RNA structure: advances and assessment of 3D structure prediction. Annu Rev Biophys. 2017;46:483-503.

  7. Blazewicz J, Szachniuk M, Wojtowicz ARNA. Tertiary structure determination: NOE pathway construction by tabu search. Bioinformatics. 2005;21:2356–61.

    Article  CAS  PubMed  Google Scholar 

  8. Lukasiak P, Antczak M, Ratajczak T, Bujnicki JM, Szachniuk M, Popenda M, Adamiak RW, Blazewicz J. RNAlyzer - novel approach for quality analysis of RNA structural models. Nucleic Acids Res. 2013;41:5978–90.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Szostak N, Royo F, Rybarczyk A, Szachniuk M, Blazewicz J, del Sol A, Falcon-Perez JM. Sorting signal targeting mRNA into hepatic extracellular vesicles. RNA Biol. 2014;11:836–44.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Zok T, Antczak M, Riedel M, Nebel D, Villmann T, Lukasiak P, Blazewicz J, Szachniuk M. Building the library of RNA 3D nucleotide conformations using clustering approach. Int J Appl Math Comp. 2015;25:689–700.

    Google Scholar 

  11. Rybarczyk A, Szostak N, Antczak M, Zok T, Popenda M, Adamiak RW, Blazewicz J, Szachniuk M. New in silico approach to assessing RNA secondary structures with non-canonical base pairs. BMC Bioinformatics. 2015;16:276.

    Article  PubMed  PubMed Central  Google Scholar 

  12. Gudanis D, Popenda L, Szpotkowski K, Kierzek R, Gdaniec Z. Structural characterization of a dimer of RNA duplexes composed of 8-bromoguanosine modified CGG trinucleotide repeats: a novel architecture of RNA quadruplexes. Nucleic Acids Res. 2016;44:2409–16.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Wiedemann J, Milostan M. StructAnalyzer - a tool for sequence versus structure similarity analysis. Acta Biochim Pol. 2016;63:753–7.

    Article  CAS  PubMed  Google Scholar 

  14. Miskiewicz J, Tomczyk K, Mickiewicz A, Sarzynska J, Szachniuk M. Bioinformatics study of structural patterns in plant microRNA precursors. Biomed Res Int. 2017; doi: 10.1155/2017/6783010.

  15. Parisien M, Cruz JA, Westhof E, Major F. New metrics for comparing and assessing discrepancies between RNA 3D structures and models. RNA. 2009;15:1875–85.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Chen VB, Arendall WB 3rd, Headd JJ, Keedy DA, Immormino RM, Kapral GJ, Murray LW, Richardson JS, Richardson DC. MolProbity: all-atom structure validation for macromolecular crystallography. Acta Crystallogr D Biol Crystallogr. 2010;66:12–21.

    Article  CAS  PubMed  Google Scholar 

  17. Zok T, Popenda M, Szachniuk M. MCQ4Structures to compute similarity of molecule structures. Cent Eur J Oper Res. 2014;22:457–74.

    Article  Google Scholar 

  18. Wang J, Zhao Y, Zhu C, Xiao Y. 3dRNAscore: a distance and torsion angle dependent evaluation function of 3D RNA structures. Nucleic Acids Res. 2015;43:e63.

    Article  PubMed  PubMed Central  Google Scholar 

  19. Lukasiak P, Antczak M, Ratajczak T, Szachniuk M, Popenda M, Adamiak RW, Blazewicz J. RNAssess - a webserver for quality assessment of RNA 3D structures. Nucleic Acids Res. 2015;43:W502–6.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Zemla A. LGA: a method for finding 3D similarities in protein structures. Nucleic Acids Res. 2003;31:3370–4.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Richardson JS, Schneider B, Murray LW, Kapral GJ, Immormino RM, Headd JJ, et al. RNA backbone: consensus all-angle conformers and modular string nomenclature (an RNA ontology consortium contribution). RNA. 2008;14:465–81.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Akiyama BM, Laurence HM, Massey AR, Costantino DA, Xie X, Yang Y, Shi PY, Nix JC, Beckham JD, Kieft JS. Zika virus produces noncoding RNAs using a multi-pseudoknot structure that confounds a cellular exonuclease. Science. 2016;354:1148–52.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Popenda M, Szachniuk M, Antczak M, Purzycka KJ, Lukasiak P, Bartol N, et al. Automated 3D structure composition for large RNAs. Nucleic Acids Res. 2012;e112:40.

    Google Scholar 

  24. Antczak M, Popenda M, Zok T, Sarzynska J, Ratajczak T, Tomczyk K, Adamiak RW, Szachniuk M. New functionality of RNAComposer: an application to shape the axis of miR160 precursor structure. Acta Biochim Pol. 2016;63:737–44.

    Article  CAS  PubMed  Google Scholar 

  25. Xu X, Zhao P, Chen SJ. Vfold: a webserver for RNA structure and folding thermodynamics prediction. PLoS One. 2014;9:e107504.

    Article  PubMed  PubMed Central  Google Scholar 

  26. Liu Y, Wilson TJ, Lilley DMJ. The structure of a nucleolytic ribozyme that employs a catalytic metal ion. Nat Chem Biol. 2017;13:508–13.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Boniecki MJ, Lach G, Dawson WK, Tomala K, Lukasz P, Soltysinski T, Rother KM, Bujnicki JM. SimRNA: a coarse-grained method for RNA folding simulations and 3D structure prediction. Nucleic Acids Res. 2016;44:e63.

    Article  PubMed  Google Scholar 

  28. Suslov NB, DasGupta S, Huang H, Fuller JR, Lilley DMJ, Rice PA, Piccirilli JA. Crystal structure of the Varkud satellite ribozyme. Nat Chem Biol. 2015;11:840–6.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Trausch JJ, Xu Z, Edwards AL, Reyes FE, Ross PE, Knight R, Batey RT. Structural basis for diversity in the SAM clan of riboswitches. PNAS. 2014;111:6624–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

This research was carried in the European Centre for Bioinformatics and Genomics, Poznan University of Technology (Poznan, Poland) and supported by the Leading National Research Centre Program (KNOW) granted by the Polish Ministry of Science and Higher Education.

Funding

This work has been supported by the Polish Ministry of Science and Higher Education and the Institute of Bioorganic Chemistry, PAS within intramural financing program. The authors acknowledge partial support by the National Science Center, Poland [2016/23/B/ST6/03931, 2016/23/N/ST6/03779].

Availability of data and materials

All predicted RNA 3D models used in our computational experiments are available at RNA-Puzzles website: http://ahsoka.u-strasbg.fr/rnapuzzlesv2/results/. The target structures can also be accessed via this webpage.

Author information

Authors and Affiliations

Authors

Contributions

JW, TZ, and MS conceived the study. MM and MS prepared a specification of the project. JW and MM designed the LCS-TA algorithm. JW made an implementation, supported by TZ who authored the basic method for MCQ computation. JW carried computational tests further analyzed with the aid of MM and MS. MS coordinated the project. JW, MM, and MS drafted the manuscript, JW and MM prepared the figures. All authors were involved in discussions, as well as reading and approving the final manuscript.

Corresponding author

Correspondence to Marta Szachniuk.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional file

Additional file 1: Table S1.

LCS-TA results for predicted models of 4R4V structure in the sequence-independent mode. Table S2. LCS-TA results for predicted models of 4R4V structure in the sequence-dependent mode. Table S3. LCS-TA results for predicted models of 4 L81 structure in the sequence-independent mode. Table S4. LCS-TA results for predicted models of 4 L81 structure in the sequence-dependent mode. Figure S1. LCS-TA results for predicted models of 4R4V in (a) sequence-independent and (b) sequence-dependent mode. Figure S2. LCS-TA results for predicted models of 4 L81 in (a) sequence-independent and (b) sequence-dependent mode. Table S5. Longest segments found within example models of 4 L81 structure in the sequence-dependent mode. Figure S3. Results of (a) Bujnicki_1, (b) Das_1, and (c) Dokholyan_1 model comparison to the target structure (4 L81) by RNAssess. (PDF 465 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wiedemann, J., Zok, T., Milostan, M. et al. LCS-TA to identify similar fragments in RNA 3D structures. BMC Bioinformatics 18, 456 (2017). https://doi.org/10.1186/s12859-017-1867-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-017-1867-6

Keywords