Comparison with ARTS and LaJolla similarity scores
The Gauss-integrals based algorithm was benchmarked against the ARTS method by using all RNA structures that were suitable for ARTS, which does not process single chain molecules that lack base pairs. The correlation between the Gauss-integrals based distances and the ARTS similarity scores was analyzed. Figure 4A shows the relationship between the ARTS scores and the Gauss-integrals Euclidean distances. This relationship, determined on the basis of 820,600 pairs of RNA chains, can be fitted by the exponential curve
The FRASS algorithm was benchmarked also against the LaJolla method, using the dataset of 101 tRNA chains reported in [17] that implies 5,050 unique pairs of RNA structures. Figure 4B shows the relationship between the similarity scores (named TM) produced by LaJolla and the Gauss-integrals Euclidean distances, which can also be fitted by an exponential curve
Pearson correlation coefficients are equal to -0.98 in both cases. The correlations are negative since the Gauss-integrals based scores are distances while the ARTS and LaJolla scores are similarities.
Benchmarking against the DARTS and SCOR classifications
An effective way to test the classification ability of a method is to compute a ROC curve. In the present study, the DARTS and SCOR classifications of RNA structures was used as an external benchmark. The database DARTS contains 1,333 RNA structures determined experimentally. They are classified into 94 clusters. Only the 789 single chain entries of DARTS were retained, since the web server described in the present manuscript handles only monomeric molecules. Gauss-integrals based distances were thus computed for 310,866 pairs of RNA structures. The NR95-SCOR dataset available at [26] contains 60 RNA chains that have more than 20 and less 300 nucleotides and are assigned to SCOR functional classes with SARA. This results in 1,770 unique pairs of RNA structures.
The ROC curve is obtained by plotting Sensitivity against (1-Specificity) defined as
where true positive (TP) and false positive (FP) that are the number of correctly and incorrectly predicted pairs of the same DARTS/SCOR cluster, while true negative (TN) and false negative (FN) are the number of correctly and incorrectly predicted pairs of different clusters. Different points in the ROC curve are obtained by varying the Gauss-integrals based distance value under which two structures are considered to be similar and to belong to the same DARTS/SCOR cluster.
Figure 5 shows the ROC curves obtained as described above. The areas under ROC curves (AUC) are 0.75 and 0.82, respectively for the DARTS and SCOR classifications. These values monitor the performance of the method. A value equal to 0.50 would be associated with a random similarity measure, while a value equal to one would be obtained with an impeccable similarity measure. The AUC values obtained in the present study compare well with those obtained with the DIAL [13], SARSA [15], and SARA [16] methods, which range from 0.58 to 0.86 depending on which benchmarking set is used and on the fine tuning of each method.
Computational time
The computation of the Gauss integrals is O(n3) in time and it was observed that for long molecules (~1,500 nt) it takes about half an hour on a standard PC. Therefore, dealing with a large database, the Gauss integrals must be pre-computed and stored on a hard disk. The computing of Euclidean distances between pre-calculated Gauss integrals is on the contrary extremely fast and the database scanning takes very few seconds on a standard PC.
Although the fact that FRASS does not produce structural alignments, it must be observed that, in general, methods that generate structural alignments are not suitable to work with large databases containing long RNA chains: although ARTS is O(n3) in time, DIAL is O(n2) in time, and LaJolla and SARSA are O(1) in time, nothing can be pre-computed and stored on a hard disk for further elaboration. The SARSA and LaJolla methods, which transform 3D structures into 1D strings, are faster than other techniques. In particular, SARSA was shown to be faster than DIAL, though no quantitative information was published [15]. LaJolla takes about 15 minutes on a standard PC to generate 5,050 alignments of RNA chains (see the datasets described in the paragraph "Comparison with ARTS and LaJolla similarity scores"). On the contrary, the computation of 5,050 Gauss-integrals based distances takes only about one second. Moreover, structural alignments and function assignments with the SARA server are limited to RNA structures with less than 1,000 nucleotides, since computations are very demanding.
Global similarity of 23S ribosomal RNAs
The large 23S ribosomal RNA from Haloarcula marismortui, the crystal structure of which was refined at 2.4 Å resolution [27], was chosen to test the web-server. As a query, we selected the chain 0, containing about 2,700 nucleotides, taken from 1FFK file of the Protein Data Bank. The most similar structure found in the database using the FRASS web-server is the 23S ribosomal RNA from Deinococcus radiodurans (PDB identification code 3CF5, chain X, about 2,700 nucleotides) [28]. The Gauss-integrals distance between the two structures is equal to 1.2 that reveals their high structural similarity because 96% of distances computed for all pairs of RNA database are larger than 1.2. The similarity ARTS score equal to 3,294.00 corresponds to 588 aligned base-pairs and 2,118 aligned nucleotides. The high global similarity detected by both methods supports the similar biological activity of the two molecules that was also analyzed in recent, detailed comparisons of their structures and functions [28, 29].