RNA 3D structure prediction using multiple sequence alignment information

Background The understanding of the importance of RNA has dramatically changed over recent years. As in the case of proteins, the function of an RNA molecule is encoded in its tertiary structure, which in turn is determined by the molecule's sequence. The prediction of tertiary structures of complex RNAs is still a challenging task. Results Using the observation that RNA sequences from the same RNA family fold into conserved structure, we test herein whether parallel modeling of RNA homologs can improve ab initio RNA structure prediction method. EvoClustRNA is a multi- step modeling process, in which homologous sequences for the target sequence are selected using the Rfam database. Subsequently, independent folding simulations using Rosetta FARFAR and SimRNA are carried out. The model of the target sequence is selected based on the most common structural arrangement of the common helical fragments. As a test, on two blind RNA-Puzzles challenges, EvoClustRNA predictions ranked as the first of all submissions for the L-glutamine riboswitch and as the second for the ZMP riboswitch. Conclusion Through a benchmark of known structures, we discovered several cases in which particular homologs were unusually amenable to structure recovery in folding simulations compared to the single original target sequence.

Background The understanding of the importance of RNA has dramatically changed over recent years. As in the case of proteins, the function of an RNA molecule is encoded in its tertiary structure, which in turn is determined by the molecule's sequence. The prediction of tertiary structures of complex RNAs is still a challenging task. Results Using the observation that RNA sequences from the same RNA family fold into conserved structure, we test herein whether parallel modeling of RNA homologs can improve ab initio RNA structure prediction method. EvoClustRNA is a multi-step modeling process, in which homologous sequences for the target sequence are selected using the Rfam database.
Subsequently, independent folding simulations using Rosetta FARFAR and SimRNA are carried out. The model of the target sequence is selected based on the most common structural arrangement of the common helical fragments. As a test, on two blind RNA-Puzzles challenges, EvoClustRNA predictions ranked as the first of all submissions for the L-glutamine riboswitch and as the second for the ZMP riboswitch. Conclusion Through a benchmark of known structures, we discovered several cases in which particular homologs were unusually amenable to structure recovery in folding simulations compared to the single original target sequence.

Background
Ribonucleic acid (RNA) is one of the key types of molecules found in living cells It is involved in a number of highly important biological processes, not only as the carrier of the genetic information but also serving catalytic, scaffolding and structural functions, and more [1]. The interest in the field of non-coding RNA such as circular RNAs [2], long non-coding RNAs [3] has been increasing for the past few decades with the new types of non-coding RNAs discovered every year. Similarly to proteins, a 3D structure of an RNA molecule determines its function. In order to build a 3D model of an RNA particle, one can take advantage of high-resolution experimental techniques, such as biocrystallography or cryo-EM [4]. However, experimental techniques are tedious, time-consuming, expensive, require specialized equipment, and not always can be applied. An alternative to experimental techniques are methods for computational modeling. However, the results of the RNA-Puzzles [5], [6], a collective experiment for RNA structure prediction, show that while accurate modeling of RNA is achievable, there is still room for improvement. In particular, recent tests [7] have demonstrated significant progress. Although encouraging, this progress still leaves the field without methods that can reliably predict RNA tertiary structure in a consistent way.
Just like proteins, RNAs can be grouped into families [8] that have evolved from a common ancestor. Sequences of RNAs from the same family can be aligned to each and the equivalency at the level of individual residues can be represented by multiple sequence alignment (MSA). The analysis of patterns of sequence conservation or the lack thereof can be used to detect important conserved regions, e.g., regions that bind ligands, active sites, or are involved in other important functions. An accurate RNA sequence alignment can be used to predict secondary structure, the Watson-Crick base pairing pattern for the RNA, a key precedent for subsequently modeling RNA tertiary structure. According to the CompaRNA [9] continuous benchmarking platform, methods that exploit RNA alignments, such as PETfold [10] outperform single sequence predictive methods for RNA secondary structure.
RNA alignments can be used to improve tertiary structure prediction. Weinreb and coworkers [11] adapted the maximum entropy model to RNA sequence alignments to predict long-range contacts between residues for 180 RNA gene families. They applied the information about predicted contacts to guide in silico simulations and observed significant improvement in predictions of five cases they researched. Another method was proposed by Martin Weigt's group [12]. These methods are reviewed elsewhere [13].
In this work, a distinct way to use RNA alignment for tertiary structure prediction is investigated. The proposed approach explores the use of multiple sequence alignment information and parallel modeling of RNA homologs to improve ab initio RNA structure prediction method. A new approach, named EvoClustRNA, takes advantage of incorporation of evolutionary information from distant sequence homologs and is based on a classic strategy of protein structure prediction [14]. By building on the empirical observation that RNA sequences from the same RNA family typically fold into similar 3D structures ( Fig. 1), we tested whether it is possible to guide in silico modeling by seeking a global helical arrangement, for the target sequence, that is shared across de novo models of numerous sequence homologs. To the best of our knowledge, EvoClustRNA is the first attempt to use this approach for RNA 3D structure prediction.
In this work, we report tests in the RNA-Puzzles 13 and 14 blind modeling trials, systematic benchmarking of the approach, and a description of the automated workflow that is now made available for the research community.

EvoClustRNA workflow
In this work, we propose a new methodology, EvoClustRNA, that can contribute to the improvement of RNA 3D structure prediction. The EvoClustRNA method takes as input (1) an alignment file, (2) a folder with models generated for homologous sequence, and (3) a file that maps sequence names from the alignment with filenames of models.
The input preparation for the workflow has to be performed manually by the user (Fig 2. 1-2). An input alignment can be obtained from the Rfam database or generated by the user.
Sequences in the alignment should be sorted by length, and the redundancy removal procedure should be applied to remove similar sequences. The shortest homologs should be modeled using the SimRNAweb server or/and Rosetta. At the final stage of the input preparation, the top 100 models from a simulation should be moved to the input folder for the EvoClustRNA workflow.
When the input files are ready, the next step of the process (Fig 2. 3-4) can be executed.
The EvoClustRNA package contains tools to make the process as easy as possible, starting from processing input models to obtain all-vs-all core RMSD matrix (evoClustRNA.py), automated clustering procedure (evoClust_autoclustix.py), ending with a script to calculate the accuracy of prediction (evoClust_calc_rmsd.py). The model of the target sequence with the highest number of neighbors is selected as the final prediction.

Blind predictions with EvoClustRNA in the RNA-Puzzles
EvoClustRNA was tested on the RNA-Puzzle 13 problem. The target of 71 nucleotides was an RNA 5-aminoimidazole-4-carboxamide riboside 5′-monophosphate (ZMP) riboswitch, which can up-regulate de novo purine synthesis in response to increased intracellular levels of ZMP [15]. The alignment for this riboswitch was downloaded from the Rfam database (Rfam ID: RF01750), whence ten homologs were selected for modeling with Rosetta. The secondary structures for all homologs were devised with Jalview based on the Rfam alignment. The pseudoknot was suggested in the available literature [16] and it was used for modeling. The EvoClustRNA prediction with an RMSD of 5.5 Å with respect to the reference structure (Fig. 3) was the second in the total ranking of RNA-Puzzles (http://www.rnapuzzles.org/blog/2000/01/01/PZ13-3d/). The final prediction was made based on the visual inspection of the best clusters, which were obtained by using the EvoClustRNA method.
EvoClustRNA was also used in the RNA-Puzzles for modeling problem 14. The RNA molecule of interest was the 61-nucleotide long L-glutamine riboswitch, which upon glutamine binding undergoes a major conformational change in the P3 helix [17]. It was the first RNA-Puzzle, for which the participating groups were asked to model two forms of the RNA molecule: one with a ligand ("bound") and another one without a ligand ("free").
However, the EvoClustRNA method was used only to model the "bound" form. The alignment for this RNA family (RFAM ID: RF01739) was downloaded from the Rfam database, whence two homologs were selected for modeling with Rosetta. It was suggested in the literature [18] that the structure included an E-loop motif. This motif was found in the PDB database and was used as a rigid fragment during the modeling. Three independent simulations were performed and the final prediction was obtained in a fully automated manner. The native structure of the riboswitch superimposed on the model obtained with the EvoClustRNA method is shown in Fig. 4. The EvoClustRNA prediction was ranked at the first place in the overall ranking with 5.5 Å RMSD with respect to the native structure (http://www.rnapuzzles.org/blog/2000/01/01/PZ14Bound-3d/). Details of these results were reported in an article describing RNA-Puzzles Round III [7].

Accuracy of prediction for RNA family
To compare the accuracy of predictions for sequences of homologs, the core RMSD was used. For each RNA family, one target sequence (sequence of the reference structure taken from the PDB database) and four sequences of homologs were processed. Full names of the sequences and secondary structures used for modeling can be found in the Supplementary data, in the text and the figure, sequences will be referred to with three letter identifiers. For different sequences that belong to the same Rfam family, divergent prediction accuracy was observed both for SimRNA and Rosetta (Fig. 5).
Interestingly, for 5 out of 8 RNA families for Rosetta and 4 for SimRNA, sequences of homologs yielded more accurate models than folding the target sequence. For example, in the case of the tRNA family, the best models from SimRNA were generated for a tRNA-Lys sequence (accession number: AB009835.1, referred as "tab") from Drosophila melanogaster (fruit fly). These models reached a core RMSD of 5 Å, in contrast, the best model of the target sequence achieved a core RMSD of 7 Å to the reference structure.
Similarly, for the TPP riboswitch, the best models from Rosetta were obtained by folding a sequence from Streptococcus agalactiae (AL766847.1, "tal").
Surprisingly, SimRNA and Rosetta performed differently for the same sequences. In 26 out of 40 folded sequences, Rosetta outperformed SimRNA (models with the lowest core RMSD to the reference structure). For example, for the target sequence and all sequences of homologs of the THF riboswitch, Rosetta generated more accurate models than SimRNA.
Similarly, for the RNA-Puzzle 14, Rosetta in the best 100 generated more accurate models for a sequence from the marine metagenome (AACY023015051.1, "cy2") homolog. In contrast, in the case of the adenine riboswitch, SimRNA generated more accurate models for the target sequence and a sequence from Clostridium difficile (AAFV01000199.1, "a99").
Together, these data indicated that folding sequences of homologs could potentially enrich with accurate predictions a pool of models taken for clustering. However, to conclusively test if this is the case, we analyzed various variants of the presented here methodology in the comparison to the controls. The results are presented in the next section.
Using MSA information to enhance the accuracy of predictions To test, if accurate predictions of sequences of homologs could improve the prediction of the structure of the target sequence, various variants of the method were compared to the controls, and the results are shown in Figure 6 and the details can be found in Table 1 of the Supplementary Data.
The following eight variants of EvoClustRNA and controls were compared to each other. As controls, the standard protocols for Rosetta FARFAR ("Rosetta") and SimRNA ("SimRNA") were used. To test the clustering procedure itself without the use of any homologous sequences, three different procedures were considered where the input was: the top 500 models from SimRNA and Rosetta combined ("SimRNA+Rosetta"), the top 1000 models from Rosetta ("Rosetta Top1k"), the top 1000 models from SimRNA ("SimRNA Top1k"). The full EvoClustRNA procedure was tested with the input including 1000 models generated for five homologous sequences (the top 200 models per one sequence) from SimRNA ("EvoClustRNA|SimRNA") and Rosetta ("EvoClustRNA|Rosetta") separately, and where 500 models (the top 100 per one sequence) produced with Rosetta and 500 models (100 per one sequence) and with SimRNA were combined into one input ("EvoClustRNA|Rosetta+SimRNA").
SimRNA Top1k reached the lowest median of RMSD, better by 1.77 Å to control, SimRNA, and better than Evo|SimRNA by 1.61 Å. For Rosetta, Rosetta Top1k and Evo|Rosetta scored worse than the control by 0.31 Å and 2.83 Å respectively. Evo|SimRNA achieved the lowest core RMSD with the difference to the control, SimRNA, of 2.26 Å. For variants of Rosetta, the best one was the control, Rosetta. In terms of INFs, the accuracy of prediction for Rosetta and Evo|Rosetta was the same (0.77). In the case of the SimRNA, Evo|SimRNA achieved INF of 0.67 and SimRNA 0.74. The differences between benchmarked variants were not statistically significant (the Wilcoxon, non-parametric statistical test to examine if related paired samples come from the same distribution).
The comparison of the two clustering modes, half and 1-of-6 mode, can be found in the Supplementary Data (Fig. S2). The analysis was performed also for various combinations of sequences of homologs (Fig. S1), e.g., taking the target sequence and one sequence of homolog one by one, then sequences of two homologs, then three and four in all possible combinations (Supplementary Data, Fig. S1). This analysis was performed with the evox_all_variants.py from the EvoClustRNA package. Also in these tests, the statistically significant overall improvement of the prediction of variants of EvoClustRNA over the controls was not detected.

Accurate predictions of structures for sequences of homologs
Encouraged by the results from the folding sequences of homologs, we searched for the more homologous sequence to investigate how they fold. Because of the computational cost of predictions, we limited our analysis to five RNA families: modeled with SimRNA (purine riboswitch, cyclic-di-GMP riboswitch, THF riboswitch, RNA-Puzzle 17) and Rosetta (RNA-Puzzle 13) (Fig. 7). Once again, we were able to identify sequences that yielded more accurate models than the target sequence. For the adenine riboswitch four sequences gave more accurate solutions, from Streptococcus pyogenes (AAFV01000199.1, "aax"), Bacillus cereus (AE016877.1, "ae0"), Clostridium botulinum (CP001581.1, "cp1"), Bacillus cytotoxicus (CP000764.1 "cp07") than models for the target sequence.
In the case of the RNA-Puzzle 17, most of the models are close to the 20 Å, however, some homologs gave single accurate models, HCF12C_58327 ("hcf") BS_KBB_SWE26_205m_c1114943 ("bsk"), 2236876006_041573 ("s23") (sequences and accession codes are taken from [19]). The striking case is the RUMENNODE_3955907_1 ("rum") homolog. This sequence yielded a set of very accurate models, much better than models for the target sequence.
For the THF riboswitch, none of the sequences of homologs gave better predictions than the target sequence. Interesting, for one of the homologs, Alkaliphilus metalliredigens (CP000724.1, "cp7"), a cluster of accurate solutions were generated. This cluster good enriches the final pool of models used for clustering and improve the selection of the final model.
In the case of the cyclic-di-GMP riboswitch, the results were consistent and comparable to the models for the target sequences and all sequences gave models of the same accuracy, with core RMSD ranging from 6.5 Å to 15 Å, after removing outliers for Peptoclostridium difficile (ABFD02000011.1,"gba") sequence).
For the RNA-Puzzle 13 (generated with Rosetta), none of the homologs gave more accurate predictions than models for the target sequence. Some homologs from human gut metagenome ("baa", BAAV01000055), Caulobacter sp. K31 ("nzh", NZ_AATH01000001) gave relatively good models, a core RMSD of 7 Å.
Example: Sampling of conformational space for the RNA-Puzzle 17 and the TPP riboswitch To understand whether there were structures that shared the same topology in comparison with the native structure in the pool of 500 models of homologs, the results of clustering were visualized with Clans [20]. To perform this analysis, we implemented a new tool called Clanstix (a part of the rna-tools package (https://rnatools.readthedocs.io/en/latest/tools.html#module-rna_tools.tools.clanstix.rna_clanstix).
Clans uses a version of the Fruchterman-Reingold graph layout algorithm to visualize pairwise sequence similarities in either two-dimensional or three-dimensional space. The program was designed to calculate pairwise attraction values to compare protein sequences; however, it is possible to load a matrix of precomputed attraction values and thereby display any kind of data based on pairwise interactions. Therefore, the Clanstix program from the rna-tools package was used to convert the all-vs-all RMSD distance matrix, between selected for clustering fragments from the EvoClustRNA|SimRNAweb runs, into an input file for Clans.
The results of clustering with Clans are shown in Figure 8. In this clustering visualization, 100 models of five homologs are shown (each homolog uniquely colored, models of the target sequence are colored in lime). Models with a pairwise distance in terms of RMSDs lower than 6 Å are connected. The experimentally determined reference structure (Fig.   8A) was added to this clustering to see where it would be mapped. Interestingly, the native structure was mapped to a small cluster, in which there are three models for the target sequence. The cluster medoid (Fig. 8B) achieved an RMSD of 7 Å to the reference structure. This clustering visualization showed that there were models generated with the correct fold, but none of them were selected as the final prediction. In the absence of the information about the reference structure, the default prediction of EvoClustRNA was the medoid of the biggest cluster (Fig. 8C).
An analogous analysis was performed for the results of clustering of EvoClustRNA|SimRNAweb runs for the TPP riboswitch. Models with a pairwise distance in terms of RMSDs lower than 9 Å are connected. Interestingly, the reference structure (Fig.   8D, dot) was mapped to a cluster of models of one of the homologs (Fig. 8F, blue). The medoid of this cluster (Fig. 8F) achieved a core RMSD of 9 Å to the native structure. This cluster was devoid of models for the target sequence and included only models of its homologs. Since SimRNAweb was not able to detect non-canonical interactions, most of the structures were in "open" conformation and were dissimilar to the reference structure.
The default prediction of EvoClustRNA (Fig. 8E) achieved an RMSD of 24 Å with respect to the reference structure.
We looked also at the diversity of models generated by the two methods used in this study. Figure 5. demonstrated that the top 100 models from SimRNA tend to be more similar to each other as compared to the top 100 models from Rosetta. The results of clustering for the TPP riboswitch are shown in Fig. S3. For this visualization, the top 100 models from each method were considered. The different diversity of models from each modeling method can be detected. The top 100 models generated with Rosetta were more diverse and sampled much bigger conformational space. In contrast, the top 100 models from SimRNA were similar to each other and sampled limited conformational space. This observation is important for further analysis when one combines models from different predictive methods to use them with EvoClustRNA.

Discussion
We present a computational workflow for processing RNA alignments to perform concurrent simulations with SimRNA and Rosetta that could improve RNA 3D structure prediction. We wanted to understand if by enriching a pool of models used for clustering with models obtained from folding sequences of homologs, we can influence the selection of the final model and thus improve RNA 3D structure prediction. To test this idea, the EvoClustRNA program was implemented. The workflow is free to use and can be downloaded from https://github.com/mmagnus/EvoClustRNA. Initially, the EvoClustRNA approach was tested on two blind RNA-Puzzles challenges. The We discovered several cases in which sequences of individual homologs were folded to much more accurate structures than the original target sequence. This demonstrated that RNA 3D structure prediction can be improved by the consideration of sequences homologous to the target sequence. However, many other homologs folded poorly and were not helpful. Further investigation is therefore needed to carefully pre-select sequences of homologs for a fully automated method of RNA 3D structure prediction.

This observation prompts investigations into a new direction of research for checking 3D
structure "foldability" of related RNA sequences. One potential solution would be to check if this "foldability" is related to free energy calculated by secondary structure prediction methods or to the potential of particular sequence variants to form stable structures and crystallize [21]- [23].
The workflow described in this study can be combined with any method for RNA tertiary structure prediction, and this is one of the possible lines of further research. As shown here, SimRNA and Rosetta achieved different prediction accuracy depending on the folded sequence, e.g., for the THF riboswitch (Fig. 5, "tha" sequence). Therefore, other RNA 3D structure prediction methods could be tested to see if they enrich the pool of accurate models used for clustering with EvoClustRNA.
The approach described here could be combined with direct-coupling analysis, proposed for example by [24] and [12]. In this approach, a DCA analysis should be performed for an alignment to generate restraints for several homologous sequences. These sequences could be then folded and EvoClustRNA could be applied to select the final model or to visualize possible folds of an RNA molecule.

Conclusions
We present a complete bioinformatics workflow for processing RNA alignments to perform concurrent simulations with different RNA 3D structure prediction methods, here exemplified by SimRNA and Rosetta. The workflow has proven useful for RNA modeling, as revealed be successful predictions for the RNA-Puzzle experiment [7]. At the current stage, the fully-automated method does not always provide a significant improvement over single sequence modeling. However, we discovered several striking cases in which particular homologs were folded to more accurate models than the original target sequence. This work, for the first time to our knowledge, demonstrates how important is the selection of the target sequence (from many variants in a multiple sequence alignment) for the success of RNA 3D structure prediction. This discovery prompted both

Reference structures
All structures solved experimentally and used in this study were obtained from the Protein Data Bank [25] and parsed to a standardized format with rna-tools (https://github.com/mmagnus/rna-tools).

Multiple sequence alignment generation and selection of homologs
Each query sequence was taken from the corresponding PDB file. The MSA was obtained from the Rfam database [32] and in case of the Pistol ribozyme, the MSA was published as the supplementary data provided by [19]. MSAs were reduced (using JalView [33], sequence similarity threshold 90%) to keep only diverse representatives. In theory, all sequences could be folded but because of the computational costs of simulations (6-10h per sequence for 80 CPUs, using either SimRNAweb or Rosetta FARFAR), we decided to fold only 4 the shortest sequences from the MSA. Once the final set of homologs to be folded was selected, the positions common to all sequences selected were determined.

RNA 3D structure prediction
For each sequence chosen for folding, secondary structure predictions were generated based on the MSA. Two methods were used in this study: SimRNA and Rosetta. For Rosetta, a total of 10,000 decoys were generated for the target sequence and each homologous sequence using the Rosetta FARFAR protocol [34]. For SimRNA prediction, SimRNAweb server was used [35]

The Rosetta method
The method used to generate and select models has been described previously [36], but will be reviewed in here briefly. Inspired by the Rosetta protein modeling tool [37] methodology, Fragment Assembly of RNA (FARNA) predicts the tertiary structure by assembling short 3-residue fragments, and then sampling using a Monte Carlo algorithm, guided by a knowledge-based energy function. The method was improved in 2010 by adding new energy terms within the force field specific for RNA molecules. The improved method was called Fragment Assembly of RNA with Full-Atom Refinement (FARFAR). This FARFAR protocol was used for modeling in this work. A total of 10,000 independent simulations are carried out (starting from different random number seeds) for each query sequence, and the resulting structures are clustered as previously reported [36]. For short RNA fragments (up to 32 nucleotides) Rosetta can be accessed via the "Rosetta Online Server That Includes Everyone" (ROSIE) [38]. However, in this work much longer sequences were modeled, so the Rosetta package was used locally at the HPC (High-Performance Computing) provided by the International Institute of Molecular and Cell Biology or, for the ZMP riboswitch RNA-Puzzle, on the Stanford BioX 3 cluster.
The SimRNA method (as implemented in the SimRNAweb server) SimRNAweb [35] is a user-friendly online interface for modeling RNA 3D structures using SimRNA [39]. SimRNA uses a coarse-grained representation of RNA molecules, the Monte Carlo method to sample the conformational space, and relies on a statistical potential to describe the interactions in the folding process. SimRNAweb makes SimRNA accessible to users who do not normally use high-performance computational facilities or are unfamiliar with using the command line tools. The simplest input consists of an RNA sequence to fold RNA de novo. Alternatively, a user can provide a 3D structure in the PDB format, for instance, a preliminary model built with some other technique, to jump-start the modeling close to the expected final outcome. The user can optionally provide secondary structure and distance restraints and can freeze a part of the starting 3D structure. The web server is available at http://genesilico.pl/SimRNAweb. In this work, all simulations were performed using the default parameters of the server. The lowest energy 100 and 200 models (called also in this work the top 100 and top 200) were generated based on SimRNA trajectories using rna-tools, i.e., the rna_simrnaweb_download_job.py script (https://rna-tools.readthedocs.io/en/latest/tools.html#simrnaweb).

Selection of common positions (conserved core)
Structural fragments corresponding to the evolutionarily conserved regions (common for all homologs) determined from the alignment are processed using evoClustRNA.py resulting in an all-vs-all core RMSD matrix. Next, the matrix is passed to the clustering script, evoClust_clustix.py to perform automated clustering in two modes: "1-of-6" and "half".

Clustering routine
EvoClustRNA uses the clustering procedure implemented earlier by Irina Tuszyńska for the analysis of RNA-protein complex models [40] and used in the NPDock server [41]. The method is an implementation of an algorithm used for clustering with Rosetta for protein structure prediction [42], also described in [14].
Briefly, a fraction of lowest-energy structures for each homolog is taken for clustering.
The clustering procedure is iterative and begins with calculating a list of neighbors for each structure. Two structures are considered as neighbors when the RMSD between them is smaller than a given distance cutoff. evoClust_clustix.py in the package is a program that performs a clustering for a user-defined cutoff, e.g., for RMSD equal to 7 Å. However, to find a proper cutoff, an iterative procedure of clustering starts from 0.5 Å and is incremented by 0.5 Å, until the required criterion is met. Two criteria were tested in this work, called "1-of-6" and "half." In the "1-of-6" mode, the clustering was stopped when the first (the biggest) cluster contained 1/6 of all structures taken for clustering. For example, for five homologs, 500 structures were clustered and an iterative clustering stopped when the first cluster contained over 80 structures. In the second mode tested, "half," the clustering procedure was finished when the first three clusters contained over half of the structures. Thus, for five homologs, 500 structures were clustered, and the iterative clustering stopped when there were at least 250 structures in the three biggest clusters.
This iterative procedure is implemented in evoClust_autoclustix.py that is a wrapper for evoClust_clustix.py.

Model selection
The final 3D model for the target sequence is the first occurrence of the model for the reference sequence in the clustering output starting from the top of the file. It there is no model for the references sequence in the first cluster, then the second cluster is processed, and so on. This analysis is done by evoClust_get_models.py automatically based on the output files generated by the clustering procedure.

Workflow implemented as EvoClustRNA
The scripts to perform the analysis are implemented in Python and freely available at Both metrics mentioned above, RMSD and INF, are used to calculate the distance between the generated models and reference structures. However, they cannot be applied directly to compare models for diverse homologous molecules that differ in sequence and length.
So to deal with this issue, a new metric based on RMSD was implemented as core RMSD.
Core RMSD considers only C3′ atoms of conserved cores (that are of the same size). The conserved cores determined based on input alignments are of the same sequence length, so there is always the same number of atoms to be compared. However, full atom RMSD for the cores cannot be calculated because the sequences can vary. That is why only a single atom, C3′, is used in this metric. Naturally, this metric is not only used for evaluation of the accuracy of predictions but also for clustering.
Calculations for evaluation of predictions are performed with evoClust_calc_rmsd.py program that is built around Biopython [44].

Structure visualizations
Structure visualizations in 3D were generated with PyMOL (version 1.7.4 Edu Enhanced for Mac OS X by Schrödinger) [45].

Statistical analyses
Statistical analyses and visualization of the data were carried out with Python 2.7 using mostly following Python packages: Matplotlib [46], Pandas, Seaborn [47], Jupyter (former IPython) [48]. The differences between benchmarked variants were tested with the Wilcoxon non-parametric statistical test implemented in SciPy. Competing interests the target sequence. "h234" means that models of three homologs were considered during clustering, the second homolog, third and fourth. For each variant 5 top clusters are shown and the first cluster is marked with a black dot. The first panel combines the results for SimRNA and Rosetta, the second panel shows the results for SimRNA and the third only for Rosetta.
Comparison of tested approaches in two modes of clustering: 1-of-6 and half.   Table 1 Additional file 6: All data required to generated RNA families tend to fold into the same 3D shape. Structures of the riboswitch cdi-AMP solved independently by three groups: for two different sequences obtained from Thermoanaerobacter pseudethanolicus (PDB id: 4QK8) and Thermovirga lienii (PDB id: 4QK9) [49] for a sequence from Thermoanaerobacter tengcongensis [50] and for a sequence from Bacillus subtilis (PDB id: 4W90) (the molecule in blue is a protein used to facilitate crystallization) [51]. There is some variation between structures in the peripheral parts, but the overall structure of the core is conserved.  The RNA-Puzzle 13 -the ZMP riboswitch. The superposition of the native structure (green) and the EvoClustRNA prediction (blue). The RMSD between structures is 5.5 Å, the prediction was ranked as the second in the total ranking of the RNA-Puzzles (according to the RMSD values).

Figure 4
The RNA-Puzzle 14 -L-glutamine riboswitch. The RMSD between the native structure (green) and the EvoClustRNA prediction (blue) is 5.5 Å.

Supplementary Files
This is a list of supplementary files associated with the primary manuscript. Click to download.