Parallel computation of genome-scale RNA secondary structure to detect structural constraints on human genome

Background RNA secondary structure around splice sites is known to assist normal splicing by promoting spliceosome recognition. However, analyzing the structural properties of entire intronic regions or pre-mRNA sequences has been difficult hitherto, owing to serious experimental and computational limitations, such as low read coverage and numerical problems. Results Our novel software, “ParasoR”, is designed to run on a computer cluster and enables the exact computation of various structural features of long RNA sequences under the constraint of maximal base-pairing distance. ParasoR divides dynamic programming (DP) matrices into smaller pieces, such that each piece can be computed by a separate computer node without losing the connectivity information between the pieces. ParasoR directly computes the ratios of DP variables to avoid the reduction of numerical precision caused by the cancellation of a large number of Boltzmann factors. The structural preferences of mRNAs computed by ParasoR shows a high concordance with those determined by high-throughput sequencing analyses. Using ParasoR, we investigated the global structural preferences of transcribed regions in the human genome. A genome-wide folding simulation indicated that transcribed regions are significantly more structural than intergenic regions after removing repeat sequences and k-mer frequency bias. In particular, we observed a highly significant preference for base pairing over entire intronic regions as compared to their antisense sequences, as well as to intergenic regions. A comparison between pre-mRNAs and mRNAs showed that coding regions become more accessible after splicing, indicating constraints for translational efficiency. Such changes are correlated with gene expression levels, as well as GC content, and are enriched among genes associated with cytoskeleton and kinase functions. Conclusions We have shown that ParasoR is very useful for analyzing the structural properties of long RNA sequences such as mRNAs, pre-mRNAs, and long non-coding RNAs whose lengths can be more than a million bases in the human genome. In our analyses, transcribed regions including introns are indicated to be subject to various types of structural constraints that cannot be explained from simple sequence composition biases. ParasoR is freely available at https://github.com/carushi/ParasoR. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1067-9) contains supplementary material, which is available to authorized users.


CHAPTER 1 ParasoR algorithm
ParasoR is a novel software application intended for local RNA secondary structure prediction on genome-scale data. ParasoR is primarily based on the Rfold grammar [1], which is based on the conventional energy models subject to the constraint of maximal distance between base pairs. In this chapter, we describe the Rfold grammar and detail the ParasoR algorithm for constructing databases of local folding energy changes used for structure predictions.

Rfold grammar
In the Rfold model, the transition rules are expressed as follows.
Outer ! ✏|Outer · a|Outer · Stem Stem ! b < ·Stem · b > |b < ·StemEnd · b > StemEnd ! s n |s m · Stem · s n (m + n > 0)|Multi In these rules, ✏ represents the null terminal symbol; a, an unpaired nucleotide; s k , an unpaired subsequence whose length is k; and b <, b >, a base pair. Rfold enables an enumeration of possible structure patterns following these transition rules. In addition, Rfold sums up the Boltzmann factor of energies of the partial structure that belongs to each state based on the inside-outside algorithm so that Rfold calculates the expected value of certain partial structure or transition based on the Boltzmann ensemble.

X ⇢
MultiBif (i, k) · ↵ Multi2 (j, k) · t 0 (MultiBif ! Multi1 · Multi2) for j < k  (i + W + 1) The expected values in our study such as stem probability are estimated as the sum of the state transition probabilities p( , k, l ! 0 , k 0 , l 0 ). where and 0 represent nonterminal symbols of the Rfold grammar; k, l, k 0 , and l 0 represent sequence positions; ⇣, a secondary structure; ⌦( , k, l ! 0 , k 0 , l 0 ), the set of secondary structures containing the transition ( , k, l ! 0 , k 0 , l 0 ); dG(⇣, x), the free energy for sequence x with structure ⇣; R, the gas constant; T , the absolute temperature; Z, the partition function; (k, l), the outside variables; ↵ 0 (k 0 , l 0 ), the inside variables; t( , k, l ! 0 , k 0 , l 0 ), the Boltzmann factor for the transition; and ⌦ 0 , the set of all the secondary structures of sequence x. Summation of the state transition probability following this equation permits the computation of several measures of structure such as base pairing probability, accessibility, and structural profiles [2,3].
For long sequences such as mRNA and pre-mRNA, however, the increased number of possible structures makes it almost impossible to calculate Z and directly even though the increase of ↵ is restricted by the maximal span.

ParasoR algorithm
ParasoR is a parallel extension of the Rfold algorithm intended for the prediction of genome-scale sequences, and it is accomplished via three procedures: Divide, Connect, and Probability calculation procedure. One of the strongest features in ParasoR is its construction of databases of ↵ Outer and Outer in the form of their fold changes such as d↵, d , ↵, and ; each element is obtained by comparison with adjacent elements. The algorithm evaluates local increase rations of ↵ Outer and Outer in logarithmic scale, ↵(i) = log(↵ Outer (i)) and (i) = log( Outer (i)). Because a logarithmic transformation is applied to all internal variables of ParasoR in order to reduce numerical errors, we use ↵(i) and (i) hereafter to explain ParasoR algorithm.
When K computational nodes are available to calculate the secondary structure of a sequence of length N and maximal span W , ParasoR requires O(NW 2 /K) computational time in the Divide procedure, O(NW ) or O(KW 2 + NW 2 /K) in the Connect procedure, and O(NW 2 /K) in the Probability calculation procedure. In the Divide procedure, ParasoR calculates partial d↵ and d at each computational node independently. Then, in the Connect procedure, ParasoR establishes a database of ↵ and from the stored d↵ and d . ↵ and are independent of how the input sequences is divided into subsequences. Finally, several expected values such as base pairing probability are computed for any queried region using the ↵ and database. In the following sections, we describe how to calculate d↵, d , ↵, and . Then, we derive the formula of the "local partition function", which is the ratio of DP variables and partition function, as a function of ↵, , and other variables whose magnitudes are bounded independently of N . This formula allows us to calculate the expectation values without the direct computation of Z and whose magnitudes grow exponentially with N .

Divide procedure
To explain the methodology of the Divide procedure, we focus on the calculation of d↵ and d for the subsequence from the sth to eth positions after its assignment to a single node in the Divide procedure. First, we describe how to calculate d↵ using v, which is defined as follows to absorb all required energy during the transition between the Stem and Outer state.
Then, ↵ is simply expressed using v.
where ↵ i,h is a partial summation of ↵ and recursively obtained as below.
In Eq. 1.7, ↵ i,h is calculated from the data available in the same computational node while ↵(s h) depends on results of the previous node. To cancel out the contribution of ↵(s h), we consider d↵ i,h , or the local di↵erences of ↵ between two adjacent values.
In this way, ParasoR computes d↵ k,h and constructs a partial database of d↵ k,h using a single node in O(NW 2 /K) computational time.
In a similar way, d is also calculated by defining another v whose boundary condition is di↵erent from the previous v. 1 if e < j and j 6 = (e + h)

Connect procedure
The variables d↵ and d in the previous section contain the dependency on the fragmentation pattern. In the Connect procedure, ParasoR integrates these variables to calculate unique local fold changes ↵ and for each position as follows.
where ↵ and are computed by the sum of Boltzmann factors of the free energy of the secondary structures inferred within the subsequences of length W · Constant or less. At the same time, the calculation of these values indirectly includes the influences of long flanking regions (discussed in Chap. 3).
We implemented the Connect procedure in two di↵erent ways. In the first way, Para-soR stores the whole matrix of d↵ and d in the Divide procedure and sequentially . This temporal storage becomes, however, very large, up to around 10 TB for the human genome sequence in double (8-byte) precision. Accordingly, ParasoR also uses another implementation that constructs a part of d↵ and d to compute a vector of ↵ and of length only (W + 1) for the ends of the assigned sequence from the saved d↵ and d . Then, each node recalculates d↵ and d and completes the databases of ↵ and for the assigned region using the partial vector of ↵ and . A pseudo code for this algorithm is given (Algorithm 1). This procedure only requires the temporary . The time complexity of the first and fourth steps is O(NW 2 /K) for each node. The time complexity of the third step is O(KW 2 ) for a single node. The time complexity of the last step is O(NW/K) for each node.

Probability calculation procedure
For whole genome sequences, a partition function Z becomes so large that we cannot compute Z directly. However, Z is required in the form of /Z in the probability formula, and the increases of and Z are considered to balance each other because a large number of possible structures would increase both of them. In this section, we show that r = Z/ is evaluated only by local variables such as ↵ and , which are restricted to increase only up to the constant value depending on the maximal span rather than the whole sequence length.
We characterize the set of the "local" structures by locally maximal, non-overlapping substructures enclosed by a base pair. The pair of bases that encloses each local structure of subsequence x i,j between i and j is referred to as the outermost pair. For the explanation of ParasoR algorithm, we define the set of outermost pairs as is one of the canonical base pairs and 5  j i  W } and the set S(i 0 ) as {(i, j) 2 P | i  i 0 < j} S {(i 0 , i 0 + 1)} for any position i 0 . The sequence length of the local structures (|j i|) must not exceed W in ParasoR algorithm because base pairs between more distant bases are not allowed by the constraint of maximal span. Thus, all possible structures should consist of non-overlapping local structures and fragments of exterior loops between them. In Equation 1.3, the magnitudes of ↵ and t do not change with N since they are computed from the subsequence x k,l whose length does not exceed W . Z and are, however, the sums of Boltzmann factors for the subsequences of length O(N ) and grow exponentially with N . On the other hand, the probability p( , k, l ! 0 , k 0 , l 0 ) should be between 0.0 and 1.0, and so a large cancellation between (k, l) and Z must occur, which reduces the numerical precision. The cancellation is assured because the contributions from the structures far outside of (k, l) are almost the same. This can be seen from the following decompositions.
Here, i 0 can be set to any position and (k, l; i, j) are the outside variables for the subsequence between the outermost pair (i, j) satisfying the initial condition Stem (i, j; i, j) = t(Outer, i, j ! Outer·Stem, i, j). Equation 1.22 follows because the outside variables are the sum of the contributions of all possible patterns of outermost pairs. Equation 1.23 also follows because a base at any position i 0 (x i 0 +1 ) is either within the outermost pair (p, q) or is an exterior base.
Then we define a local partition function r(i, j) for the computation of expected values of local structures on subsequence x i+1,j as below.
In this chapter, we have already shown the DP algorithm that directly computes ↵ and without recourse to ↵ Outer and Outer . The inner variables u(p, q) can be computed without numerical di culties by using the ordinary inside algorithm.
In this manner, we can avoid the computation of variables that exponentially increase with N for r(i, j). Using r(i, j), a fold change (k, l)/Z is replaced by the outside variable for local structures and local partition function as below.
In the ParasoR software, it computes r(x, x) for the start position x first, then uses it to calculate r for the outermost pair (i, j) shown as below.
We should consider at most W 2 patterns of i and j for r computation and less than W + 1 additions of energies that depend on subsequences shorter than 2W . Also, an addition of ↵ and should be executed fewer than W times. Therefore, in this way, we can compute an expected value only by variables whose absolute value has an upper bound depending on W .
For the next position x + 1, the calculation of r(x + 1, x + 1) from r(x, x) takes O(1) time because the following equation holds.
Because both a number of these summations and the length of required subsequences are not directly dependent on length N , ParasoR is able to compute the expected values using only "local" variables whose magnitudes are independent of N . We also discuss the numerical precision of computed variables such as ↵, , and expected variables for actual data in Chapter 3.

CHAPTER 2
Manipulation of datasets and database construction

Dataset of transcript sequences
RefSeq annotation data mapped to the hg19 (human) and GRCm38 (mouse) genomes were used in the transcript database construction as a source for mRNA and pre-mRNA sequences. As a result, we obtained 45,377 pre-matured RNAs which include non-coding RNAs for human and 33,988 for mouse. They mapped to the same region on the genome sequence as many as 19 times. Prior to structure prediction, we concatenated each sequence and inserted W ambiguous characters "N" between each transcript to avoid interactions between neighboring RNAs.

Annotation rule
To examine disparities in estimates of structure preference among types of genomic regions, we annotated genome sequences according to several functional annotation categories, which were prioritized according to the strength of their sequence constraint. First, we tagged gene regions on chromosomes according to RefSeq data for both coding RNA (as CDS, 5 0 -UTR, 3 0 -UTR, and introns, in descending order of importance) and non-coding RNA (as exons and introns) ( Figure S1). These annotated groups contain repetitive elements in di↵erent ratios ( Figure S2) and these proportions of repeat sequence probably have a close relationship with the function of each category. Our regression, however, specifically cannot normalize repeat regions enough as a total e↵ect of each k-mer can be expressed by a simple summation. As such, we excluded repetitive regions from other annotation groups. We describe the important detail about the property of repetitive elements in Chapter 5. In addition, the antisense strand of the transcribed regions was removed from the intergenic regions and classified as "Antisense" group because of the possible strict constraints derived from the coding genes in the sense strand. Although the antisense sequence of a transcribed gene may be possibly mapped to another transcribed gene, the majority of such regions were classified into the antisense group ( Figure S3). The analysis of average stem probabilities for each 32-mer may have su↵ered from the influence of multiple annotations around the boundary regions. To remove such ambiguous e↵ects, fragments that overlap with multiple annotations were classified into "Multiple" annotation group.

Shu✏ing of sequence
To investigate the structural preferences of genome sequences, randomly shu✏ed sequences were used for comparison. The codon composition ratio is a conserved factor among transcripts. Thus, 3-mer shu✏ing using the ushu✏e module (http://digital.cs.usu.edu/ mjiang/ushu✏e/) was applied to produce random transcripts for each concatenated mRNA and pre-mRNA sequence. At the same time, we divided whole sequences into each block with ambiguous characters inserted as a boundary between each transcript or found in pre-mRNA as an ambiguous region. For genome analyses, chromosome sequences are too long for the memory usage requirements of ushu✏e. Accordingly, we randomly shu✏ed chromosome 1 for each individual sequence block in the resolution of single nucleotide, again divided by strings of the ambiguous character "N".

Sequence logo construction
To construct sequence logos in Figures S35, the RWebLogo package (http://cran.rproject.org/web/packages/RWebLogo/index.html) was applied to partial sequences around motif sites. For sequences around splice sites, however, the original sequence set was too large to apply RWebLogo. Hence, we implemented seqLoGo (https://github.com/ carushi/seqLoGo) in Go language, which allows compressing all sequences to a small number of sequences while still expressing the same information with an accuracy rate that depends on the number of sequences retained after compression.

CHAPTER 3
Validation of ParasoR for long RNA While a method of predicting RNA secondary structure has already been widely developed, to the best of our knowledge, there has not yet been a study that has examined structural properties of genome-scale data with various parameters. In this chapter, we examined the congruence of our results under a variety of conditions. Moreover, we analyzed properties of the energy model under the maximal span constraint.

Property of energy model and stem probability
The possible patterns of RNA secondary structure generally multiply in accordance with the sequence length N . A minimum free energy (MFE), which is one of the measures for evaluating the structure stability of RNA, also decreases with N . Although MFE can be normalized by dividing it by N to compare MFE between RNA sequences of di↵erent lengths, a previous investigation has indicated the risk of over normalization by division by N [4]. Because there is insu cient investigation of energy increase such as ↵, we computed the distributions of ↵ and using concatenated transcript sequences with the maximal span W = 200, 400, and1, 000. We randomly sampled one ↵ per 1,000 nt and Figure S4 shows that most of the ↵ distribution was within a range from 0 to 1.6 even though the variance estimates increase with W . Although the length of transcript sequences without ambiguous bases is short, the result of human chromosome 1 sequence was also consistent with that of transcripts sequences.
In addition, because each local structure contributes to the increase of ↵ and equally, the increase of is intrinsically comparable to ↵ based on existing energy models. In this way we showed that ↵ and calculated in ParasoR are small enough to calculate accurate stem probabilities while avoiding numerical errors even for genome sequences. On the other hand, because a partition function Z is calculated by summation of ↵, Z is expected to be too large value for large N since it can be estimated by exp( ↵ ⇥ N ), where ↵ represents an expected value of ↵.
The MFE is widely used to evaluate structure stability and extract structural preference of the sequence. However, this measure reflects only the best stabilized structure while ignoring numerous other structures [7]. Since the main purpose of ParasoR is the analysis of genome-scale RNA sequences, which decreases the quality of MFE estimates owing to an excessive number of possible structures. As such, we used the stem probability p stem (i) as a measure to evaluate local structural preferences. The stem probability is another typical index that expresses the tendency of structure preferences at a singlenucleotide resolution, and it has the advantage of being able to consider the stability of base pairing among an enormous variety of structures. To determine the general tendency of stem probability p stem (i), we generated 1,000 random sequences of 1,000 nt in length with a 50% GC content and averaged p stem (i) for each sequence without limiting the maximal span, in which p stem (i) is consistent with that of RNAfold (data not shown). In this analysis, p stem (i) was practically distributed around 0.6 with a notable fall at both ends of sequences, indicating that artificial sequence boundaries potentially cause such bias ( Figure S5).
We also calculated a conditional base pairing probability p Cbpp for each base pair to show that such a tendency is produced by the specificity of pairing with other sequence tips. Here, p Cbpp of the jth position among base pairs with the ith base is defined as follows.
where p Bpp (i, j) is the pairing probability between the ith and jth bases, and p Bpp (i) is a marginalized base pairing probability for the ith base , which is equal to p stem (i). Hence, p Cbpp is comparable to a likelihood of being partner with ith base. Figure S6 shows an example of p Cbpp profiles for the 461th and 11th bases. While a profile of p Cbpp for a nonterminal base indicates structural preferences for pairing within neighborhoods, such a profile for ends clearly shows that they are likely to form base pairs with their opposite ends. This result raises an issue about sliding-window methods because it can reduce the calculation time but generate many biases at the ends of each window although the global folding algorithm such as ParasoR may cause them only at the ends of sequences.
Next, we simulated a distribution of stem probabilities p stem and average stem probabilityp stem for genome sequence data. As shown in Figure S7(L), the distribution of p stem is bimodal at 0 and 1 and becomes nearly unimodal by window averaging after the structure prediction. In particular, the distribution ofp stem for each 32-mer or longer unit is almost unimodal. We also investigated the influence of di↵erent energy parameters on the distribution of p stem . In Figure S7(C), although both of models show a bimodal distribution with the peak around 0 and 1, the strength of bimodality is di↵erent for each energy model. In particular, the Turner (2004) model is shown to have produced more strongly bimodal results than did the Andronescu (2007) model. This is likely because their energy parameters were estimated from di↵erent knowledge of secondary structures and thus reflects properties of the datasets. The influence of di↵erent maximal span sizes was also tested for various window sizes using structure-prediction data from human chromosome 1. To evaluate an agreement between di↵erent conditions, a correlation coe cient was computed for several window sizes ranging from 10 to 400 nt. A comparison betweenp stem from Turner (2004) and Andronescu (2007) was higher than 0.8, at least for any selected window size ( Figure S8). Even accessibility, another typical index for structure evaluation, showed a high correlation withp stem . We also compared p stem with W = 200 and that of three other maximal spans: 100, 150, and 400. All of them show the correlation coe cient higher than 0.8 for the window size larger than 32-mer ( Figure S8).
For subsequent comparisons among di↵erent regions, a proper normalization according to sequence compositions is required. As such, we applied a linear regression method in which a unimodal distribution ofp stem is suitable. Although a larger window size produces stronger unimodality, structural preferences specific to each region were also averaged to the mean value. Based on these results, we selected the 32-mer window size in order to produce a concordant unimodal stem probability distribution without excess standardization.

Testing the accuracy of ParasoR by comparison with RNAplfold
To evaluate the robustness of ParasoR against overflow problems, we compared stem probabilities calculated with ParasoR and RNAplfold, which calculates the same values theoretically when the maximum window size and maximal span of base pairing are set to W . For comparison, we generated random sequences of 1,000, 3,000, and 5,000 nt in length, each with a 50% GC content. Figure S9 shows boxplots of stem probability di↵erences between ParasoR and RNAplfold (ViennaRNA package v2.0.7) for each sequence with varying the maximal span W . There is little di↵erence between these methods among 1,000 nt sequences. However, critical di↵erences appeared for sequences with lengths of 3,000 and 5,000 nt. This discrepancy is due to an overflow of RNAplfold that results in a computed probability exceeding 1, while ParasoR can still produce an appropriate distribution in the range from 0 to 1 for much longer sequences such as genome sequences ( Figure S10).

Influence of flanking region
We examined a way to take account of the structural influence of flanking sequences around target sites. To visualize the remodeling of the secondary structure by flanking sequences, the PTGFR gene (NM 001039585) was chosen as a subject sequence because it is su ciently long (> 49, 000 nt) and located at chr1:78,956,727-79,006,386, which is not the edge of a sequence block. Using RNAplfold, p stem (i) was calculated for a 200 nt center region of subsequences that contains a center region and varied lengths of flanking regions at both sides within the range at which no severe overflow problems occur in Sec. 3.2. As shown in Figure S11, di↵erences of two stem probabilities converge to 0 as longer flanking regions are added to the RNAplfold computation. This demonstrates that the computed stem probabilities converge when flanking sequences at both ends exceed 2W bases, which is consistent with previous findings in Ref. [1,3]. Thus, the stem probabilities computed for the entire chromosome roughly correspond to those for sliding windows with large (⇠ 4W ) margins. ParasoR simulated RNA secondary structure for the whole chromosome sequence while RNAplfold computed stem probability with a part of flanking sequence. The x-axis represents the length of flanking sequences around the region applied for the computation of RNAplfold. We also calculated the influence of flanking region from an opposite viewpoint by appending additional sequence in order to examine the change in stem probability. For comparison, p stem (i) was calculated for a sequence with N = 1, 000 by RNAplfold. Comparing this stem probability to that calculated by ParasoR for the same sequence revealed that they have little di↵erence from each other ( Figure S12). Then, we appended random flanking regions of N =100, 200, 1,000, 5,000, and 10,000 nt to both sides of the sequence, resulting in an increased disparity that eventually converged to a fixed di↵erence of stem probability. This also suggests that flanking regions have enough influence to change structural preferences only around the sites in the range of ⇠ 1000 bases.

Varying the precisions of floating point numbers
We investigated how the numerical precision of floating point variables a↵ects the Para-soR results. Figure S13 shows the di↵erences in p stem (i) computed for 1,000-nt random sequences with increasing the length of flanking sequences using di↵erent floating point variables. While the di↵erences in p stem (i) between 32-bit (float) and 128-bit (long double) variables increase rapidly, those between 64-bit (double) and 128-bit variables remain small, indicating using double or long double enables to safely analyze long RNAs without degradation of numerical precision.

CHAPTER 4 Consistency of thermodynamic prediction methods and experimental structure analyses
Recently, high-throughput structure analyses such as PARS have gained attention because of their wide application range and high productivity. However, there has been few comprehensive comparison between computational predictions and such analyses because of an inability to produce comparable amount of results via computational methods. In this section, we present a comparison of our prediction method and slidingwindow methods with PARS data for human transcripts. Moreover, we performed a validation of ParasoR prediction for cis-regulatory elements structures listed in Rfam database.

PARS score manipulation
We used normalized PARS score datasets GSM1226157 and GSM1226158 obtained from renatured samples of GM128678 [6]. PARS scores were calculated for 74,211 genes from v i and s i data, which indicate read coverage normalized by sequence depth of samples after RNase V1 and S1 degradations, respectively; the maximum scores of v i and s i are 216, 048 and 252, 848 respectively, and the average scores are 1.21 for v i and 1.20 for s i . To obtain sequence information of transcripts for ParasoR prediction, we extracted 33,603 genes that are included in PARS dataset and mappable to our RefSeq gene database. Among them, we excluded genes for which there were no nucleotides with s i or v i higher than 0. Then, we calculated the PARS score for each position following the equation PARS = log 2 (v i + 5) log 2 (s i + 5), which is defined in [6]. Moreover, we used raw v i and s i scores for further filtering of obscure information. In particular, if a threshold is x, every position with v i + s i less than x is excluded from the comparison.

Concordance between ParasoR predictions and PARS data
Next, we examined the concordance between PARS scores and two computational prediction measures; stem probability p stem of each single nucleotide and accessibility of each 10-base unit. Both of these measures exhibited a significant correlation with the PARS score and in particular the stem probability exhibited a higher correlation coe cient than did accessibility. The Pearson's product-moment correlation coe cients were 0.212 (p < 2.2e 16) for stem probability and 0.00238 (p < 0.019) for accessibility. This seems to be because accessibility is a measure of being accessible for a series of nucleotides at a certain time, despite positional independency of stem probabilities and PARS scores. We also investigated the correlation of the average stem probability for each 32-merp stem with PARS score, and their correlation coe cient is 0.208 and still high (p < 2.2e 16). Accordingly, we used p stem andp stem as measures in the subsequent analyses.
We also tested a prediction regarding long maximal spans to determine the influence of a fixed W . Figure S14 shows the distribution of correlation coe cients calculated between PARS scores andp stem for each 20-mer, which is a measure of consistency used in [6]. A distribution with W = 200 apparently contains much more regions of positive correlation between the PARS scores andp stem . When we increased W to 1, 000, there was little change to the distribution, though it also introduced a positive bias. According to this, di↵erences between ParasoR prediction and PARS scores are expected to not be caused by ignoring distant base pairs but other factors instead, such as complicated secondary structures, higher dimensional structures, and experimental biases.
As previously described, the distribution of PARS scores is highly concentrated around 0 and has a long tail at both sides. Because sequence reads are likely to be assigned sparsely, almost uncovered and totally obscure positions might contributed to structural tendencies. Accordingly, we examined the influence of such obscure regions by filtering the sequence data with a variable threshold t for the minimum read coverage. Figure S15 shows scatter plots of PARS scores and p stem when t = 0 and t = 40. Apparently, application of a strict t threshold removed data with PARS scores around 0 and elucidated a concentration of samples with positive PARS scores and ⇠ 0 p stem or negative PARS scores and ⇠ 1 p stem . Actually, the correlation coe cient between p stem and PARS score at each position increases with threshold even though the two variables di↵er in sample size ( Figure S16). Therefore, it is suggested that positions with rich information of sequence reads is likely to be consistent with ParasoR prediction.
Next, we drew receiver operating characteristic (ROC) curves of ParasoR, and two stochastic sliding-window methods, LocalFold and RNAplfold with PARS scores while varying thresholds to evaluate the accuracy of ParasoR in classifying regions as structured or accessible. To calculate p stem , three energy models were applied to this analysis, Turner (2004), Andronescu (2007), and Andronescu (2010) energy model. Among them, Andronescu (2010) model was applied to ParasoR only because RNAplfold, and eventually LocalFold cannot read a parameter file of Andronescu (2010) model.
Firstly, we obtained thresholds of evenly spaced PARS scores between the maximum and minimum values to define true structured and accessible regions. Then for p stem , we set thresholds from 0.0 to 1.0 to categorize each nucleotide as structured or accessible. The accuracy of the predictions was evaluated by setting true structured and accessible regions according to the PARS threshold with prediction of being structured or accessible according to the threshold of p stem .
Firstly, we fixed a threshold for PARS score to a middle value (-0.39). Table 4.1 shows an area under the ROC curve (AUC) which was obtained by varying a threshold for classification of stem probabilities. AUC values of ParasoR are slightly less than those of LocalFold and RNAplfold and the result is consistent for three energy models. Since setting a maximal span W to 1,000 decreases AUC compared to AUC with W = 200, 200 is considered to be a proper maximal span for structure prediction of transcripts. For stem probability computation, Andronescu (2010) model, which was optimized using the Boltzmann likelihood algorithm, has a highest accuracy for prediction, followed by Andoronescu (2007), and Turner (2004) model.
Comparing an impact of energy model and software, selecting a di↵erent energy model produced a larger change of AUC than selecting a di↵erent method.
Next, we confirmed an accuracy of each tool with di↵erent conditions such as thresholds for PARS score and filtering by read depth. Figure S17 shows AUC values about classification of three tools for whole or filtered dataset, and for a di↵erent threshold for PARS score. As a result, both filtering of low-read depth region with a liberal threshold for PARS score and no filtering with strict thresholds increased AUC for prediction. Strict thresholds for PARS scores increased AUC of each tool, indicating high accuracy. Although a liberal threshold decreases the AUCs, filtering out rarely mapped regions increases the AUC as well as the accuracy of strict thresholds. Such tendency was common for Turner (2004) and Andronescu (2007) model.
We also made similar comparisons among 5 0 -UTR, CDS, and 3 0 -UTR analyses. Figure S18 shows the distribution of PARS scores for each annotation class. In both of distributions, with or without filtering, 5 0 -UTR is found to have more structured (> 0) and less accessible (< 0) regions compared to other groups. In [6], however, 5 0 -UTR was estimated to be more accessible using an average of PARS scores even though this measure is considered strongly influenced by large absolute values. In contrast, our prediction method evaluates each position equally within a range from 0 to 1. As an absolute read coverage at each position is also determined by the number of mapped reads, it is supposedly an inappropriate measure to compare with p stem . Accordingly, we calculated a median PARS score of each annotation class for comparison. In Figure 3(C)) in the main text, the median PARS scores for CDS, 5 0 -UTR, and 3 0 -UTR regions were 0.6041, 0.6280, and 0.5083, respectively. These scores agree with the order of average p stem determined by ParasoR, which are 0.5956, 0.6248, and 0.58978, respectively, and furthermore, this order is robust to di↵erent filtering thresholds.
These analyses indicate that computational predictions and PARS analyses are highly congruent with each other in their ability to classify sequences as accessible or structured and in their measurement of regions based on functional annotation.

Concordance between ParasoR predictions and Rfam structures
To evaluate ParasoR prediction, we used CisReg dataset [17], which is compiled from the Rfam database and contains high-quality structures in the context of long sequences (detailed in Methods and Result section in the main text). For this dataset, Turner (2004) and Andronescu (2007) model were applied to ParasoR for computation of stem probability. In addition, we prepared dataset of genome and mRNA sequences whose length are longer than 3,000 nt (represented as genome long and mRNA long). Then, AUC was computed for the prediction of each tool, dataset, and energy model condition.
As a result, ParasoR with Turner (2004) model showed the highest accuracy compared to other tools or ParasoR with Andronescu (2010) model regardless of the minimum length of sequences in dataset. It is an opposite result to that of PARS data. We inferred it is due to the di↵erence of stem probability distributions among di↵erent tools. According to the distribution of stem probability ( Figure S20), ParasoR tends to predict a stem probability around 0 or 1 while LocalFold and RNAplfold produce gently sloping distributions. A bimodal distribution of ParasoR is similar to that of RNAfold ( Figure S7(Center)) and appropriate to predict rigid structures such as those in the CisReg dataset. On the other hand, the distributions of LocalFold and RNAplfold are supposedly more suitable for PARS data, which has a bell-shaped smooth distribution. In fact, we can increase concordance with the PARS data by using the Andronescu model and averaging stem probabilities to reduce the bimodality of the distribution (see the right bar graph of Figure 3(A) in the main text). Figure S7(Center) shows that Andronescu (2010) model instead of Turner (2004) model decreases the bimodality of stem probability distribution. Figure S7(Left) shows that averaging of stem probability also makes a distribution close to unimodal. These indicate the lower accuracy of Para-soR for PARS data as compared to RNAplfold and LocalFold mainly results from the di↵erence of their distributions.

CHAPTER 5
Genome-wide simulation by ParasoR

Linear regression of stem probability with sequence composition statistics
The stability of an RNA secondary structure is considerably a↵ected by sequence composition because the secondary structure is comprised of base pairs that di↵er in strength depending on base type. However, sequence compositions are constrained by multiple biological functions such as coding for proteins, regulating gene expression, and so on. The determination of structural preferences requires a normalization of the influence of sequence composition. GC content is a feature that is commonly used for proper structure evaluation. Moreover, previous computational and experimental analyses have shown an apparent correlation between GC content and evaluation index [4]. Additionally, genome sequences are known to have even more complicated sequence biases, for example, AT/GC asymmetry or codon bias. As such, we designed a normalization method for genome-wide comparisons of stem probability using GC content and other, more complex features. Using python 2.7 and the NumPy library, we implemented a linear regression using the average stem probabilityp stem (i) with a ridge penalty. A simple implementation is available at our website (https://sites.google.com/site/cawatchm/). We examined the linear regression of stem probabilities with GC content as well as k-mer composition frequencies with k = 3 and 4. We used a least squares method for estimation of a parameter vector w and a regularization term , and its error function is formulated as below.
In this formula, is a constant and w is a parameter vector in the same dimension as x. This equation is di↵erentiable at w, and we obtain w to minimize this error function.
For example, when we selected a 32-mer window size and GC content as a regression feature, we calculate the GC content and average stem probability for each 32-mer fragment. Then, we set x n and y n as follows. The first element of x n shifts the mean stem probability to around 0.6. Similarly, x n for the 3-mer composition regression is formulated as Through this normalization process, 32-mer fragments with more than window size/4 ambiguous characters "N" were excluded. For parameter optimization, we used stem probability data from both strands of human chromosomes and calculated the correlation coe cient between y n and w T x n . In this analysis, 4-mer composition exhibited a higher correlation coe cient with stem probabilities than did GC content ( Figure S21). Although other window sizes were tested for summarizing features, feature window sizes that were the same as those of average stem probabilities achieved the highest correlation coe cients. Based on this result, we used a 32-mer window size for feature calculation in the subsequent regressions of average stem probabilityp stem (i). Figure S22(L) shows the result of estimation with GC content and 4-mer regression for human chromosome 1. An alteration of is observed to have almost no influence on the correlation between y n and w T x n , likely because there is su cient data to avoid overfitting. Hence, we set to 0 in subsequent analyses. Figure S22(R) shows the highest correlation coe cients of stem probability with GC content, 3-mer composition, and 4-mer composition. Figure S23 shows scatter plots of these GC content and 4-mer composition regressions with the actual stem probability for each 32-mer fragmentq stem . Owing to its fewer parameters, GC content was outperformed by 4-mer composition in data normalization. Accordingly, we mainly discuss the result of stem probabilities normalized by 4-mer composition p stem (computed by (p stem regression ofp stem )). However, structure tendency determined by our analyses generally does not change between the GC and 4-mer normalization analyses. This normalization additionally reduces autocorrelation ( Figure S24). According to this analysis, normalized stem probability indicates structure tendency within a narrow range compared to raw stem probability, which is highly influenced by regional heterogeneity in sequence composition.

Calculation time and memory
We applied ParasoR to human chromosomes and surveyed its resource and time e ciency by measuring the computational time and memory used by the HGC super computer.

Human chromosome
Three hundred computational nodes were used in parallel to analyze the human chromosomes. Figure S25 shows the time and memory usage averaged for the plus and minus strand of each chromosome. For this genome-wide calculation, ParasoR used 0.92 days at most for each chromosome, and the computational time is consistent with its linear dependency on sequence length N . Maximum memory usage was approximately 4 GB, which is typically available in personal computers.
We measured the computational time and memory usage for analyzing a concatenated human genome sequence (⇠ 3.1G) with an HGC super computer. Table 5.1 shows the maximum computational time of each process in memory-saving mode for the human genome sequence. Using 3000 nodes, ParasoR requires at most 3.3 GB of memory for these processes. If we ideally run all processes in parallel, ParasoR can complete its probability calculation of the entire human genome in 18.2 hours.

Comparison of calculation time with other tools
For comprehensive mRNA analyses, we measured a running time of two of slidingwindow methods, LocalFold and RNAplfold, for the calculation of stem probability or accessibility. Our dataset contains totally 140,031,550 nt of mRNA sequences and we distributed mRNA sequences into 300 multiple FASTA files to contain the same number of bases as much as possible. We computed elapsed times of RNAplfold and LocalFold for these 300 files and obtained an average calculation time (Table 5.2). For the sequence of 140M nt, A calculation time of ParasoR is estimated for Divide, Connect, and Probability calculation procedure by the slope and intercept value obtained from the running time of human chromosomes ( Figure S25). This suggested that a running time of ParasoR is comparable to that of LocalFold. In addition, it is also possible to accelerate the structure calculation even more when we have already constructed database because we can skip the step of Divide and Connect procedure.

Random sequences for the simulation of pre-mRNA and intron computation
While ParasoR is the only application which is available for human chromosomes and genome sequence, it is just a simulation and such long RNA is never transcribed in real.
To clarify the utility of ParasoR parallelization for real transcripts, we applied ParasoR algorithm of single core and multiple core mode to the random sequences whose length is in the range of human pre-mRNA (from 1000 to 1,000,000 nt) and which have 50% GC content . Figure S26 shows an average elapsed time of ParasoR in the single core or multiple nodes of HGC super computer for 100 sequences. In ParasoR, the number of computer nodes is automatically reduced to an appropriate size depending on the balance of sequence length and maximal span. Thus, we showed the maximum number of available computing nodes in Figure S26. In addition, only a single sequence was applied to measure the running time of 100,000-nt and 1,000,000-nt sequence dataset due to the problem of long calculation time. We plotted a hundredfold running time for these conditions. As a result, the running time proportionally increased according to the sequence length N . The result of ParasoR running time is less than tenth of that of Rfold. Although ParasoR with multiple nodes possesses an overhead cost for distribution and requires an additional process such as database construction, this parallelization achieved a substantial acceleration of the structure prediction for longer sequences as the number of computer nodes increases. In our human pre-mRNA dataset, the longest intron exceeds 1,000,000 (1,055,451) and ParasoR also showed its e ciency for the sequences of such length. Therefore, we concluded that ParasoR algorithm is still e↵ective for pre-mRNA and full-length intron sequences.

Stem probability distribution for each genomic region
In order to understand the structural features of genome-sequences themselves, we simulated RNA secondary structures of chromosomes using ParasoR. Following the annotation rules outlined in Chapter 2, a distribution of both raw and normalized stem probabilities was generated for sequences across 10 types of annotation. In Figure S27(L), non-coding regions are observed to be more similar to introns than to CDS regions. This comparison was repeated for log ratios of densities log(f t (x)/f Intergenic (x)) (t = genomic region); x normalized average stem probability p stem (i)). Figure S27(L) shows that most of annotation categories in transcribed regions contain more structured fragments than do intergenic regions. Transcribed regions are thus more likely to be structured than are intergenic and repetitive regions. In addition, the distribution of these log ratios for CDS and repeat regions are notably di↵erent from those of the other annotation categories. In particular, repeat regions were unlikely to be su ciently normalized. Thus, repetitive regions can be incorrectly estimated to be less structured. Figure S27(R) shows a similar result for the transcribed regions, though it includes repetitive fragments. Although the distribution of raw stem probabilities is less influenced by this inclusion, negative normalized stem probabilities increased the log ratio of each annotation category. To determine the structural tendency of such regions, it would be necessary to make comparisons among regions with the same repeat sequences in di↵erent positions or annotation categories. We also normalized stem probability by using GC content ( Figure S28(L)), and the same tendency was obtained, though GC content did not normalize the unimodal distribution as well as 4-mer composition did.
When using stem probabilities of 1-mer-shu✏ed human chromosome 1 (detailed in Chapter 2), the random sequence was highly concentrated at 0.6 compared to intergenic regions (data not shown); on the other hand, a normalized stem probability distribution of shu✏ed sequence contains more structured regions ( Figure S28(R)). This is likely because repetitive, GC-rich regions contribute to increases of structure across regions. This suggests that each genomic region is less structured than completely randomized sequence because of various biological constraints.
We confirmed these genome-wide analyses of structure preference using the mouse genome. The raw and normalized stem probabilities agree with those observed across the human genome (data not shown). As such, structural features may be shared between even distantly related species.

Genomic features in structured and accessible regions after normalization
In previous analyses, structural preferences were detected for each annotation category and compared with intergenic regions. Then, we investigated the correlation between genomic sequence features and structural preferences regardless of genomic annotation.
To determine the cause of these structure tendencies, accessible and structured regions were defined as having normalized stem probabilities of more than 0.3 and less than -0.3, respectively. Then, we computed the entropy of 2-mer and 3-mer frequencies in accessible and structured 32-mer regions as follows. For comparison, the entropy of chromosome 1 was also calculated. Figure S29 shows an example entropy distribution computed for 2-mer and 3-mer fragments on human chromosome 1. As expected, accessible and structured regions are biased toward a low entropy in comparison with all of chromosome 1. In particular, accessible regions are likely to have lower entropy than structured regions are. The relationship between structure tendency and conservation was examined using PhastCons scores [5]. Figure S30 shows boxplots of normalized stem probability grouped by PhastCons score. Although there is a fluctuation of the median, no correlation was observed between conservation and structure tendency. In addition, the Pearson product-moment correlation coe cient was also not significant (p > 0.05).

CHAPTER 6
Structural preferences in mRNA and pre-mRNA In the previous chapter, specific structural tendency was observed in transcribed regions of genome sequences. In this section, structural features of transcribed region were surveyed in the forms of mRNA and pre-mRNA.

Structural preferences of human and mouse transcripts
First, a distribution of average stem probabilitiesp stem was calculated for mRNA and pre-mRNA in human and mouse transcriptomes ( Figure S31). Analysis of these datasets yields similar distributions as those of full genome sequences, but repeat regions in the mouse genome produce a small peak among extremely unstructured regions, and this may be produced by distinctive repetitive elements in the mouse genome such as SINE B1 and B2 elements. We also calculated log ratios of densities log(f t (x)/f Intergenic (x)) where f t (x) is the probability density of p stem (i) at x for each annotation group t on mRNA and pre-mRNA sequences in the same way as in the genome analyses. Figure S31 shows that all groups, except for CDS, 5 0 -UTR, and repeat regions, are likely to be more structured than intergenic regions. While CDS and repeat regions are similar to those of entire genome sequences, 5 0 -UTRs are more accessible than their expected distribution based on sequence composition. This is likely explained by boundary e↵ects such as the insertion of ambiguous characters or the phenomena in which the tips of sequences tend to be accessible (described in Chapter 3). This result is consistent across human and mouse transcriptomes.

Comparison of structural preferences with antisense and shu✏ed sequences
To evaluate structure tendency in transcribed regions, both antisense sequences and shu✏ed 3-mer sequences were used as a reference for comparison because they have the same GC content and sequence lengths. In the log ratio against intergenic regions, only antisense introns exhibit an opposite tendency compared with their sense strand counterparts ( Figure S31(C)). This is likely because intronic sequences have an AT/CG asymmetry, which forces antisense regions to have a tendency against structure. Next, we compared the distribution of sense sequences and these backgrounds using D statistics. To compare structural similarity, D statistics from the Kolmogorov-Smirnov test between sense strand and background distributions were evaluated using the ks2samp function in the SciPy library. Figure S32 shows D statistics of mRNA and pre-mRNA compared with the normalized stem probability distribution of antisense and shu✏ed 3-mer sequences. We used all of regions of shu✏ed sequences for comparison although repetitive regions and multi-annotated windows were excluded from sense and antisense sequences. Both intronic and exonic regions of transcripts significantly di↵ered from their antisense (p-values: exon mRNA, 9.66e-20; exon, pre-mRNA ⇠ 0.0; intron ⇠ 0.0) and shu✏ed 3-mer sequences (p-value: mRNA, ⇠ 0.0; pre-mRNA ⇠ 0.0). All of D statistics are positive and this means that exonic region is unlikely to be structured compared to antisense and shu✏ed 3-mer sequences.
To determine the trend in the structural tendency of transcribed regions, we utilized signed Z-scores in a Wilcoxon's rank sum test using the distribution derived from the intergenic regions. Positive Z-score statistics indicate an inclination toward structured or less accessible bases compared to intergenic regions, while negative Z-score statistics indicate the opposite. Sense introns were likely to form structures while antisense introns exhibited an oppositely signed Z-score and thus tendency (Figures 6, S33). After splicing, sense strands of CDS regions exhibited a decrease of structure tendency after splicing, presumably owing to conformational alteration. Both 5 0 -UTRs and 3 0 -UTRs of the sense strand had a structural preference that was discordant with the antisense strand even though the distribution of 5 0 -UTR stem probabilities was uniformly a↵ected by a boundary e↵ect, as mentioned before.

Stem probability profile around start codons, termination codons, and splice sites
Around splice sites (SSs), specific secondary structures have been reported to have an influence on splicing e ciency. Hence, we examined structural tendencies with respect to location specificity including around the starts and ends of translation and SSs. Around start and stop codons, our inferred profiles agree with experimental analyses and are highly influenced by GC content (Figures 4 and S35(L)). Our results also included 3-mer periodicity of structure and GC content, which produces the structural periodicity [8].
We also investigated profiles around SSs using pre-mRNAs with at least three SSs. Figure S34 shows the average stem probability for each position within 40 nt around SSs grouped by the first, second, and third SSs from the front and back (where duplication was allowed). The first SSs clearly tend to be structured, while weaker disparities in stem probabilities are observed around the last SSs. This preference corresponds to high GC contents associated with CpG islands in or around promoter regions ( Figure S35(C)). In the profile of average stem probability di↵erences, however, there is little disparity among SSs ( Figure S36).
Then, we computed the normalized average stem probability of 32-mer fragments p stem along the sense and antisense strands. After this, we computed Z-scores for the Wilcoxon's rank sum test of the distributions of each position surrounding these SSs with the intergenic distribution of such data. To calculate p stem , each pre-mRNA sequence was fragmented into 32-mers from the start or end of each SS. Then, p stem was determined according to its distance from the nearest SS. Figure 7 in the main text shows that intronic regions in the sense strand tends to have base pairs. On the other hand, similar antisense strand regions have the opposite tendency (although this tendency is not statistically significant for every position after Bonferroni correction). Accordingly, introns have a tendency toward being structured even though it is not clear whether this preference comes from selective pressure on secondary structure or not. A peak positive Z-score is found in the polypyrimidine tract region of the sense strand, where RNA has been suggested to be single stranded to specifically prevent binding of U2AF65 [9]. Therefore, the polypyrymidine tract domains are suggested to be surrounded by the regions where stem structures are preferred for their sequence compositions.
In this chapter, we directly compare stem probabilities before and after splicing to ensure that normalization accurately accounts for the nature of various sequence biases. To directly compare mRNA and pre-mRNA, the di↵erence in average stem probability for 32-mers was computed for the same exonic fragments of mRNA and pre-mRNA, which have the same base composition and thus are normalized via the k-mer regression in the same way 7.1 General tendency of conformational changes between mRNA and pre-mRNA To understand the general tendency of alterations to the secondary structures, we computed q stem (i) =p stem, mRNA (i) p stem, pre-mRNA (i) for each 32-mer of the mRNA sequences. The distribution of q stem (i) that is negatively biased, suggesting that secondary structures are more accessible in mRNA strands than pre-mRNA strands. In addition, q stem (i) tends to be larger around SSs than those of distant positions because of the disappearance of splicing influences on secondary structures ( Figure S37). By calculating a probability of crossing base pair around SSs, we showed that pre-mRNA tends to form base pairs crossing over both of SSs compared to mRNA ( Figure S38). Thus, it is considered that intronic regions around SSs are likely to form base pairing with exonic regions against the exonic regions which are concatenated after splicing. When we determined q stem (i) according to annotation group, q stem (i) was significantly more accessible for all annotation groups except 3 0 -UTRs. However, because the average distance from SSs di↵ers for each annotation type, we filtered out almost unchanged samples with lower di↵erences than a given threshold so that we could test structure di↵erences only for substantially influenced regions. For each threshold, a number of post-accessible fragments exceeded the post-structural fragments in general ( Figure S39(L)). Wilcoxon's signed rank test indicated that CDS and non-coding exon were significantly accessible at every threshold (Table 7.2, 7.3, Figure S39(R)). In contrast, significant influences were not detected in other groups at other thresholds. Therefore, although other annotation groups have a tendency to become more accessible after splicing, their changes are supposed to be not significantly drastic.

Cause of drastic conformational changes
We investigated the factors that exacerbate structural changes by focusing on each SS. To discover the causes of drastic conformational changes during splicing, we tested the correlation between structure arrangement and four quantitative features: (1) gene expression, (2) GC content around mRNA SSs, (3) GC content around pre-mRNA SSs, and (4) intron length. The gene expression data consisted of CAGE data generated by FAN-TOM5 [10], and it was used to investigate the correlation of gene expression with structural influences of splicing. Each transcription start site expression count was averaged among di↵erent samples and removed as a tissue-specific gene if log 10 (medianexpression)  0.5. Regarding other features, GC content around SSs was calculated for 200 nt windows at exon junctions as well as at 5 0 -and 3 0 -SSs.
As measures of conformational change, a median, maximum value, minimum value, and median of the absolute deviation were calculated for the distribution of the disparity in stem probability between mRNA and pre-mRNA q stem (i) = p stem, mRNA (i) p stem, pre-mRNA (i) for 200 nt around SSs. To remove the duplicate influence of nearby SSs, each position was exclusively assigned to the nearest SS. SSs with fewer than 100 assigned bases were removed from subsequent analyses.
Median q stem values were significantly correlated with both gene expression data and GC content of mRNA and pre-mRNA ( Table 1 in the main text). A negative correlation between gene expression and median q stem can be interpreted as a small number of regions becoming structured in highly expressed mRNAs. A median absolute deviation (MAD) of q stem was obtained by median(| q stem median( q stem )|) for each SS and used to indicate the magnitude of fluctuation. According to our analyses, the MAD of q stem only significantly correlates with mRNA GC content ( Table 1 in the main text). This suggests that the secondary structures of GC-rich mRNAs are likely to be less altered by the splicing process. In addition, the maximum and minimum q stem values are shown in 7.1. The maximum q stem is significantly correlated with all three of these features although the minimum q stem is significantly correlated with mRNA GC content only. These results suggest that splicing specifically regulates structured RNA through its e↵ects on translational e ciency.

Gene set enrichment analyses
To characterize the relationship between conformational arrangement and biological functions, gene set enrichment analyses were performed for genes that are conforma-tionally changed between mRNA and pre-mRNA. We computed a median q stem (i) for each SS and defined a q stem (u) for a given SS u. We produced subsets of postaccessible and post-structural regions by selecting the top 1% and 10% of SSs according to q stem (u). The top 1% of post-accessible and post-structural SSs were mapped to 230 and 233 RefSeq genes, respectively, and those of top 10% were mapped to 2,087 and 2,195 RefSeq genes, respectively. Then we checked the distributions of the maximum intron length, average intron length, and number of introns for all and selected genes, respectively. The distributions of the top 1% and 10% of genes were significantly di↵erent from that of all the genes (Table 7.4), but the distribution of the top 1% of genes may have di↵ered as a result of the small sample size.
Then, these genes were remapped to UniProt IDs (208 (post-accessible) and 204 (poststructural) IDs for top 1% genes, and 1,868 (post-accessible) and 2,000 (post-structural) IDs for the top 10%) for Gene Ontology (GO) enrichment analyses using Ontologizer [11]. Table 7.5 shows the GO terms that were detected as significantly enriched among influenced genes. There were no enriched GO terms among the top 1% of genes. However, 23 GO terms were enriched among the top 10% of post-accessible genes, though no GO terms were enriched among post-structural genes. For post-structural and postaccessible gene sets in the mouse transcriptome, the top 1% of post-accessible and post-structural genes exhibited 41 and 48 enriched GO terms, respectively. Among the top 10% of mouse genes, 6 and 10 GO terms were significantly enriched among the post-accessible and post-structural genes, respectively (Table 7.6). These GO terms overlapped with those identified in the human analyses-for example, organelle, catalytic activity, and phosphorus metabolic process. Among the top 1% of genes, however, there was no consensus between human and mouse genes in terms of GO term enrichment, and the distribution of intron features in human and mouse di↵ered among all genes, as mentioned before. Hence, we applied a threshold of 10% for selecting post-accessible and post-structural genes.
Moreover, functional clustering was performed on the post-accessible gene set with DAVID [12], which eventually established 338 clusters based on this 10% threshold. The keywords of the clusters with the highest expression analysis systematic explorer (EASE) score were as follows: cytoskeleton (16.19), kinase (11.90), centrosome (9.18), actinbinding (7.12), and ubiquitin conjugate (6.42). Since the ubiquitin conjugate cluster includes genes related to muscle function such as Titin and to synaptic function such as Synaptoagmin-associated genes, this group seems to be involved in the construction and di↵erentiation of muscle or metabolic pathways. We also applied DAVID to postaccessible and post-structural mouse genes. For post-accessible genes, the extracted functional clusters with the highest EASE score were as follows: cytoskeleton (14.60), ATP-binding (13.11), centrosome (7.33), cell division (5.72), and actin-binding (5.42). For post-structural genes, the extracted functional clusters with the highest EASE scores were as follows: ATP-binding (16.57), kinase (6.19), C2 domain (4.23), and cytoskeleton (4.17).

Di↵erence of structure propensity between sense and antisense sequence
We suggested the structure propensity of intronic regions di↵er from each other in sense and antisense strand in the main text. That tendency was shown only after the normalization by the k-mer composition. In this section, we thus examined the di↵erence of structure propensity between sense and antisense sequence about intact stem probability, partition function, and -centroid structure computed by energy models.
To extract the structure propensity bias of the energy model, we produced random sequences of 200-nt long with the specific GC or GU content, and then computed the partition function of them and their complementary sequences. As a result, the partition function of random sequences showed few deviation between sense and its antisense sequence, and the relationships of the partition function with varying GC content is almost linear (Figure S40(L)). On the other hand, the biased GU content produced the non-linear relationships between the partition function of sense and antisense sequences ( Figure S40(C)). This is supposed to be mainly caused by the disappearance of G-U base pairs in the complementary sequence. In addition, random sequences having the same GU content still showed the deviation of the partition function, indicating the existence of more complex biases in both of Turner and Andronescu energy model.
Next, the comparison of stem probability was applied to NEXN gene (NM 144573) ( Figure 9 in the main text). To examine the relationship of stem probability between sense and antisense strand, we calculated the di↵erence of stem probability (mRNA p stem -pre-mRNA p stem ) for each position of sense and reversed antisense sequences. Figure S41 shows the average di↵erence of stem probability and GU content for each 32-nt window on the mRNA and pre-mRNA sequence. The average di↵erence of stem probability is clearly proportional to the GU content of the window in the sense strand. Also, pre-mRNA is longer than mRNA so that it contains more biased positions in terms of the di↵erence of stem probability as well as GU content.
We also used NEXN gene to investigate the di↵erence of -centroid structures in sense and antisense strand. Splicing out of the 3rd intron sequence was observed to produce the largest di↵erence of stem probabilities on the peripheral exonic region among all of 12 introns in NEXN gene ( Figure S9). In Figure S42, the partial -centroid structures were extracted around the 3rd intron region of NEXN gene in sense and antisense strand (visualized using http://biojs.io/d/drawrnajs). -centroid structures were originally predicted for the whole sequences, but only the partial structures around the 3rd intron were extracted for visualization not to have any base pair which has a probability higher than 0.5 with the exterior sequences. Although the longest stem loop is roughly conserved, 2 of 3 stem loops in the right side of the sense structure disappeared from the multi-loop in the antisense structure. The length of stem loop in the left side also became shorter in the antisense structure, resulting the long multi-loop was newly formed in it.
Furthermore, the consensus sequence of human Alu repeat, shown as below, was extracted from RepBase [13] for -centroid structure prediction.
• Alu-Jo subfamily GGCCGGGCGCGGUGGCUCACGCCUGUAAUCCCAGCACUUU GGGAGGCCGAGGCGGGAGGAUUGCUUGAGCCCAGGAGUUC GAGACCAGCCUGGGCAACAUAGCGAGACCCCGUCUCUACA AAAAAUACAAAAAUUAGCCGGGCGUGGUGGCGCGCGCCUG UAGUCCCAGCUACUCGGGAGGCUGAGGCAGGAGGAUCGCU UGAGCCCAGGAGUUCGAGGCUGCAGUGAGCUAUGAUCGCG CCACUGCACUCCAGCCUGGGCGACAGAGCGAGACCCUGUC UCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA Figure S43 shows the -centroid structure ( = 1) of the Alu-Jo subfamily sequence in sense and antisense strand. Although the structure of the sense sequence shows two arms of stem loop, antisense one possesses long multi loop as well.
Therefore, it is concluded that the predicted structure and structure propensity of sense and antisense sequences possibly show the critical di↵erence depending on their sequence composition. We computed Bonferroni-corrected p-values by using Wilcoxon's rank sum test.  Procedures to compute p-values for some nonparametric tests Our experiments to detect structural preferences require computing p-values for very large datasets that cannot be handled by popular statistical analysis tools such as the "R" programming environment [14]. Here, we describe the procedures to compute pvalues for the analyses in which R could not be used.

Wilcoxon's rank sum test
We used Wilcoxon's rank sum test to estimate the structural preferences of each transcribed region relative to intergenic regions ( Figures 6 and 7 in the main text). Since the datasets were very large (⇠ 10 8 ), we kept only the histograms of stem probabilities, which necessitates the handling of tied values. Since it is impossible to compute the exact null distribution for large data sets with ties, we used the normal approximation, in which p-values are computed from the asymptotic normal distribution for a Z-score statistic of the test. Let X = {X i |i = 1, . . . , n X } and Y = {Y j |j = 1, . . . , n Y } be the two datasets to be compared, then the Z-score of the Wilcoxon's rank sum test is given where n = n X + n Y and R n is the rank function applied to the merged dataset X [ Y , which returns a rank 2 [1, n] for each value based on the increasing order of data values. If there are tied values, their average rank is assigned to every element within the tie group. G represents the number of tie groups and m g represents the number of elements in tie group g.
In the above formula, sign(u) represents the sign function, which is defined by and c(u) is the continuity correction term. Under the null hypothesis, the Z-score asymptotically follows the standard normal distribution.

Wilcoxon's signed rank test
Wilcoxon's signed rank test compares two datasets X and Y of paired observations {(X i , Y i )|i = 1, . . . , n}. It was used to detect systematic changes of structural strengths due to splicing ( Figure 8 in the main text). There are some ambiguities for the definition of the Z-score in the cases in which there are observations (X i , Y i ) such that X i = Y i . In the wilcox.test() function of R, such observations are simply ignored, and the Z-score is defined by

Conover's squared ranks test for testing equal variances between samples
Conover's squared ranks test is a nonparametric test for detecting di↵erences of variance among multiple samples [15]. We used the test to show that the distribution of normalized average stem probabilities p stem in the coding regions has significantly smaller variance than that of intergenic regions ( Figure 5 in the main text). Let where n = n X + n Y and R n is the rank function applied to the merged dataset {D Xi } [ {D Y j }. As in the previously described procedure, the average rank is assigned for tied values.

Computing p-values from Z-scores
For a given Z-score z, the p-value p(z) for the hypothesis test on the more significant side is computed by: where erfc(x) is the complementary error function. When the computation of p(z) based on this formula returned zero for large |z|, we used the exact upper bound of log 10 (p(z)) [16] log 10 (p(z)) < ✓ z 2 + ln (2⇡z 2 ) 2 ln (10) ◆ which can be derived from the continued fraction representation of erfc(x) x p ⇡ where only the first term in the continued fraction is taken into account on the right hand side.                                  -centroid structure of sense (Upper) and antisense (Lower) consensus sequence of the Alu-Jo subfamily. Black and red arrows correspond to the start and end point in sense strand, respectively. As in Figure S42, we reversed the antisense sequence and structure.