SSS-test: a novel test for detecting positive selection on RNA secondary structure

Background Long non-coding RNAs (lncRNAs) play an important role in regulating gene expression and are thus important for determining phenotypes. Most attempts to measure selection in lncRNAs have focused on the primary sequence. The majority of small RNAs and at least some parts of lncRNAs must fold into specific structures to perform their biological function. Comprehensive assessments of selection acting on RNAs therefore must also encompass structure. Selection pressures acting on the structure of non-coding genes can be detected within multiple sequence alignments. Approaches of this type, however, have so far focused on negative selection. Thus, a computational method for identifying ncRNAs under positive selection is needed. Results We introduce the SSS-test (test for Selection on Secondary Structure) to identify positive selection and thus adaptive evolution. Benchmarks with biological as well as synthetic controls yield coherent signals for both negative and positive selection, demonstrating the functionality of the test. A survey of a lncRNA collection comprising 15,443 families resulted in 110 candidates that appear to be under positive selection in human. In 26 lncRNAs that have been associated with psychiatric disorders we identified local structures that have signs of positive selection in the human lineage. Conclusions It is feasible to assay positive selection acting on RNA secondary structures on a genome-wide scale. The detection of human-specific positive selection in lncRNAs associated with cognitive disorder provides a set of candidate genes for further experimental testing and may provide insights into the evolution of cognitive abilities in humans. Availability The SSS-test and related software is available at: https://github.com/waltercostamb/SSS-test. The databases used in this work are available at: http://www.bioinf.uni-leipzig.de/Software/SSS-test/. Electronic supplementary material The online version of this article (10.1186/s12859-019-2711-y) contains supplementary material, which is available to authorized users.

. To best represent the probability ensemble visually, secondary structures can be used. The centroid structure is preferable rather than the most common used minimum free energy one, since the centroid represents the structure that is closest to all other possibilities. snoRNA HACA76 (Fig. 1 left) is a uniform family with a possible candidate for positive selection. This family has a divergence score of d = 0.1 and selection scores indicating negative selection (s = 0.0) for all species, with the exception of Human, with a score that could indicate positive selection (s = 15.8). The visual profile of the structures of this family is clearly uniform, with only one structure (Human) standing out ( Fig. 1 left). As a comparison, local block five of lncRNA blastnMacaque.Locus 222061 ( Fig. 1 right) is a non-uniform family, with a family divergence score of d = 57.9. Except for Orangutan, this block has intermediate and high selection scores for all species (s > 5.0). Importantly, in this particular case high selection scores unlikely point towards positive selection, rather than diverse functionality among the species, or as an alternative hypothesis, even incorrect orthology. is an example of a non-uniform family, with no clear trend of structure. The branching pattern was constructed according to the species phylogenetic tree, with the species structure distance score on the branches. Species distance score is defined by the structural distance between species and its consensus, normalized by the alignment length. Pan (Panel B) represents both chimpanzee and bonobo species, as described in (Necsulea et al., 2014).

Choice of an appropriate threshold for filtering non-uniform families in primate databases
Filtering out structurally non-uniform families is a reasonable first step for screening databases to look for positively selected structures. For that the family divergence score d of the SSS-test can be used, as discussed in the previous section. To choose an appropriate threshold for the primate databases used on this work, we visually inspected 12 families of ncRNAs (Table 1), with different divergence scores, ranging from d = 0.0 to d = 65.0. Importantly, the choice of threshold may vary from project to project. The profiles observed in this work came from primates, which are phylogenetically very close. For different projects, with more distant or closer species, the threshold may be adapted to best fit the data. In addition, the candidates should be subjected to functional testing for confirmation of the predictions.  Table 2).
The centroid structures of all species belonging to the family were taken into account along with the s scores of the human structure (Fig. 3). All considered families are low-divergent.

Alternative Assessment of Positive Selection
Using the basic idea of the Ka/Ks-test, we classified single nucleotide changes into structurally conserved (sc) and structurally disruptive (sd). As in the SSStest, we used RNAsnp p-values to discriminate between the types of sites. We The power of this approach however seems to be quite limited. First, it requires a reasonable number of changes for calculation purposes. This works well for HAR1, since it is the region in the entire human genome that accumulated the most human-specific changes (c = 13 according to our experiments, also accounting for compensatory bases). In contrast to that, the number of changes per lncRNA is usually rather small. In our analysed local blocks, the mean number of human-specific changes isx = 0.74, with a variance of σ 2 = 9.
In addition to that we found many false positive signals when analysing families by visual inspection.
We tested whether better estimates could be obtained by using a Poisson distribution for the expected number of substitutions parametrized by the expected change rate for a family, which is more appropriate when substitutions are sparse. This indeed improves the robustness, but still relies on the correctness of the classification of the sites, which in itself is not precise enough. This approach still led to many false positives. Hence, we abandoned this idea in favor of using the evidence provided by RNAsnp as a quantitative rather than a categorical variable.
Using the Ksd/Ksc approach, we investigated a subset of the 15,443 families provided by Necsulea et al. (2014). This subset was composed of lncRNAs enriched in human-specific changes. 76 families showed in the primate alignment strong conservation among non-human primates and higher sequence divergence in humans (a similar situation to the HAR1). Applying the Ksd/Ksc approach to this collection resulted in 543 conserved local structures. Of these, 71 had sufficient orthologous substructures and 7 showed signs of positive selection in humans. For this analysis, no filter was applied in regards to family divergence.
From the positive selection candidates, two were also detected in the current approach of the test and five were not considered because of their family divergence, which is higher than the set threshold we applied in the new approach.
Interestingly, the top scoring structure, which is part of the lncRNA H19X, received a score for the Ksd/Ksc approach of s(H19Xsub2) = 2.79 and, like HAR1, has a significantly more stable secondary structure in human (Fig. 4).
Interestingly, this same structure received a positive selection score with the SSS-test (s(H19Xsub2) = 29.4) and low score for the other species (s ≤ 1.2).
Although the family did not pass the divergence threshold (d = 22.1), we still suggest this structure as worthwhile to investigate, especially considering that the threshold serves for guiding purposes, and this family seems uniform enough.
Another one of the local structures found to be under positive selection in humans by the Ksd/Ksc approach was fundamentally different in humans, when compared to the other species. Visually, it did not seem to belong to the same group as the other structures. This brought the concern whether it was correctly annotated. To re-assess the orthology with a more robust approach, we used the Infernal suite (Nawrocki et al., 2009), by first building a covariance model for Figure 4: Local lncRNA structure predictions for H19Xsub2: (left to right): human, gorilla, orangutan and pan. The human structure received a positive selection signal with the Ksd/Ksc approach, as well as the approach of the SSStest. The colours of the bases are assigned according to their pairing frequency in the structure's ensemble. Shades of red occur in ≥ 90% of the ensemble, shades of green/yellow denote increasing probabilities from ≥ 50%. For unpaired bases, shades of red denote increasing unpairedness.
the non-human primate structures, then calibrating it and finally searching it in the entire human genome. Interestingly, the single hit maps to an intronic part of the orthologous transcripts, suggesting that the human lncRNA may have lost one exon (Fig. 5). The annotated gene structure is drastically different in the different primates. While human and macaque have only two exons, chimpanzee/bonobo and gorilla have five, and orangutan features six introns, suggesting a rapid turnover of gene structure in this lncRNA gene. The initially classified human CUFF.464429sub4, depicted in yellow, initially received a positive selection signal with the Ksd/Ksc approach. This structure was re-classified to a new orthologous group and the revised human orhologous of CUFF.464429sub4, depicted in purple, was found by using the non-human primate structures as a search model. More details in the supplemental text. Coordinates refer to assemblies of ENSEMBL v62 as described in (Necsulea et al., 2014).

Indel impact on structure
One of the challenges of detecting positive selection on ncRNA structures is to account for the structural impact of indels. To assess indel impact on ncRNA structures, we build a framework to simulate gap evolution (Fig. 6). Given an input ncRNA (the ancestral structure), it evolves the sequence inputing gaps of size n (defined by the user) in a window-based manner, from the first base to the last. It assesses and outputs the structural impact of the gap in all positions using the RNAforester tool (Höchsmann, 2005). Importantly, RNA structure comparison cannot be treated like the problem of sequence comparison, for instance. In sequence comparison, there is only one layer of complexity, the order of the elements, or bases, while in structure comparison, there is another layer that must be accounted for, which is the secondary dimension of the structure. RNAforester implements structural tree alignments, pointing out their dissimilarities, which is convenient for calculating indel impact on a structure. In addition, structures of different lengths can also be compared with this tool.
Using this idea and applying rank statistics, we accounted for the structural impact of observed indels in the SSS-test (more information on the main text).
We also applied the gap analysis framework (Fig. 6) on biological RNAs (one example in Fig. 7), to check if the length of the gap was important for the structural impact. The sequence of the tRNA used in the example is: >tRNA UUUGGGUGUAUAGCUCAGUUGGUAGAGCAUUGGGCUUUUAACCUAAUGGUCGCAGGUUCA AGUCCUGCUAUACCCACCA We performed the same experiment in 12 other biological RNAs and noticed that the length of the gap did not matter for the impact, but rather its location.

Secondary structure
Structral impact of inserted gap Gap size 10 2 5 Figure 7: Structural impact of inserted gaps in a tRNA structure. x axis: tRNA base positions from 1 to 78, y axis: structural impact of the inserted gap at position x. Blue line: gap of size 2, orange line: gap of size 5, red line: gap of size 10. As a reference, the dot-bracket structure of the tRNA MFE structure is below the sequence. Note that the impact is zero when the gap is located in an unpaired region (denoted by dots), and it is higher whenever the gap overlaps a paired region. This behaviour is independent of the gap size.

SSS-test implementation and usage
The SSS-test was implemented as a bash script SSS.sh that calls separate perl scripts executing specific modules. These were used for all experiments cited in the main paper and can be downloaded along with a README tutorial file at: http://www.bioinf.uni-leipzig.de/Software/SSS-test/.
The SSS.sh can be run in the command line as:

SSS.sh -i <FASTA_FILE> -f <FILE_FORMAT> -s <Yes/No>
with -i indicating input file, -f format (fasta or aligned) and -s if the user wants the secondary structures to be saved or not in a subfolder.
The following tools are used internally, and must be installed beforehand: • RNAsnp (Sabarinathan et al., 2013) • muscle (Edgar, 2004) • Vienna RNA package (Lorenz et al., 2011) • Bio::AlignIO http://search.cpan.org/dist/BioPerl/Bio/AlignIO.pm     Figure 11: Orthologs of the lncRNA SIX3-AS1 in five primates with a conserved local structure. Local structure sub11, denoted in purple, show signs of positive selection in humans and negative selection in the other species. This local structure is present in all five primates, being located in the exons of human, orangutan and macaque, in the spliced transcript of pan, between two exons and outside of the reported locus in gorilla, in a region with no other annotated elements. Introns are depicted as light grey rectangles and exons as dark grey arrows. Coordinates are given in accordance to (Necsulea et al., 2014) for the assemblies of ENSEMBL v62.  (a) Human Figure 14: Comparison of MIATsub31 MFE structures. MIATsub31 local structure is highly conserved among non-human primates, and different in humans, with a longer more stable structure, and a signal for positive selection.
(a) Human Figure 15: Comparison of MIATsub31 centroid structures. MIATsub31 local structure is highly conserved among non-human primates, and different in humans, with a longer more stable structure, and a signal for positive selection.  Figure 16: Percentage of sites that reach the threshold to be considered wellconserved, per different thresholds of what consititutes a well-conserved site from 1 to 100%. The default of the SSS-test is a 60% threshold, which leads to 98.6% of the sites being well-conserved.