reactIDR: evaluation of the statistical reproducibility of high-throughput structural analyses towards a robust RNA structure prediction

Background Recently, next-generation sequencing techniques have been applied for the detection of RNA secondary structures, which is referred to as high-throughput RNA structural (HTS) analyses, and many different protocols have been used to detect comprehensive RNA structures at single-nucleotide resolution. However, the existing computational analyses heavily depend on the experimental methodology to generate data, which results in difficulties associated with statistically sound comparisons or combining the results obtained using different HTS methods. Results Here, we introduced a statistical framework, reactIDR, which can be applied to the experimental data obtained using multiple HTS methodologies. Using this approach, nucleotides are classified into three structural categories, loop, stem/background, and unmapped. reactIDR uses the irreproducible discovery rate (IDR) with a hidden Markov model to discriminate between the true and spurious signals obtained in the replicated HTS experiments accurately, and it is able to incorporate an expectation-maximization algorithm and supervised learning for efficient parameter optimization. The results of our analyses of the real-life HTS data showed that reactIDR had the highest accuracy in the classification of ribosomal RNA stem/loop structures when using both individual and integrated HTS datasets, and its results corresponded the best to the three-dimensional structures. Conclusions We have developed a novel software, reactIDR, for the prediction of stem/loop regions from the HTS analysis datasets. For the rRNA structure analyses, reactIDR was shown to have robust accuracy across different datasets by using the reproducibility criterion, suggesting its potential for increasing the value of existing HTS datasets. reactIDR is publicly available at https://github.com/carushi/reactIDR. Electronic supplementary material The online version of this article (10.1186/s12859-019-2645-4) contains supplementary material, which is available to authorized users.


Input data of high-throughput structural analysis
To determine RNA secondary structure in transcriptome level, tens of HTS methods have been actively developed by combining RNA structure-specific probing and short read sequencing techniques. In probing-based HTS methods such as SHAPE-Seq [2] and DMS-Seq [3], the base modification induced at flexible and reactive nucleotides increases the drop-off ratio of reverse transcription (RT) on the spot. This phenomenon induces an enrichment of 5 ′ -ends at a single base downstream of accessible bases after RT. Cleavage-based HTS methods such as PARS [4] and ds/ssRNA-Seq [5] also measures the same enrichment as the strength of singleor double-strandedness, depending on a treated nuclease. For example, nucleases S1 and V1, applied in PARS, have been used to cause cleavage at loop and stem structures, respectively. Note that nuclease S1 and V1 have individual characteristics and limitation to recognize base pairs, as a cleavage of V1 is suggested to show a preferential recognition for the tips of helices rather than the inside of helices [6]. Among such many HTS methods applicable for transcriptome, there is still a common problem of systematic biases, such as differences of RNA expression or gradual decrease of sequencing read from 3 ′ -to 5 ′ -end. To solve such problem, some previous studies take additional information into accounts such as total read counts aligned to each transcript or elongated cross over each modification [5]. However, there is no perfect method to reduce the influence of noises sufficiently so far.
In this study, a typical input data from HTS is assumed to be the set of coverage profiles of sequencing reads whose 5 ′ -ends are located at a single base downstream of each position as shown in Figure 1 of the main paper. Those profiles are represented by "read count", "read depth", or "read coverage" information. Moreover, reactIDR can handle various types of read count information such as read count normalized by the reads passing through the position, or reactivity scores computed by any existing method. In this study, however, reactIDR was simply applied to the datasets of raw read counts.
Let there be read coverage data of K samples from HTS. IDR computation is only applicable to K samples in a single condition and it requires certain preprocessing for case/control comparison. On the other hand, reactIDR is applicable to the case of the replicate samples associated with two different conditions, such as reagent treated/untreated samples in SHAPE-Seq-like methods and nuclease S1/V1 treated samples in PARS experiments. For simplicity, we define these two conditions as case and control, respectively. It should be emphasized that the case and control samples in this study do not necessarily correspond to truly case and control samples from the aspects of each experimental methodology. This is because reactIDR aims to extract the significant enrichment of read counts obtained for only one of the conditions compared to others.
Here, we introduce some notations for the description in the following subsections. For the ith position of transcript t, the total of mapped reads c t,i,k is obtained from the dataset at the kth experiments among the K experiments. Because IDR estimation is performed on a Gaussian mixture copula model for the joint cumulative distribution, c t,i,k is first converted into a rank order v t,i,k scaled into the range (0, 1) in kth sample. The tied values are randomly ordered to fit a copula model, in which a marginal joint distribution is assumed to be a continuous uniform distribution without any ties. This randomness is expected to enhance the irreproducibility of low read-depth regions. Then, an input of reactIDR can be considered as a K-dimensional vector v t,i for each position i and transcript t as defined below. This conversion can be performed for either of each transcripts or all transcribed nucleotides. For the latter case, read counts for each transcript are virtually concatenated into an L-long sequence of a single transcript. At the same time, an appropriate transition boundary is set between the individual transcripts due to the contextual dependency assumed in reactIDR. In IDR method, there is no assumption about a contextual dependency between neighborhood bases. On the other hand, in reactIDR method, the dependency between immediate neighbors is considered in one-dimensional HMM to take into accounts the local similarity of secondary structure as well as read coverage. In this study, we applied the first conversion for the accurate prediction of well-sequenced rRNA datasets while others will be still effective for the datasets including poorly-sequenced transcripts. For that reason, a notation of the transcript t is omitted from v i in our description, hereafter. The whole set of v i for all positions V is represented as below.
Also, the number of samples K is limited to a reasonable number of 3 in a single, and 6 in case/control condition at most because most of HTS experiments have been carried out with duplicates or triplicates for each condition.
In addition, let there be read coverage data of K samples for reactIDR, in which the first (latter) K/2 samples are treated with nuclease S1 (V1) in PARS dataset, and reagent (DMSO) in icSHAPE dataset, respectively. In that case, K/2, represented by k t hereafter, should be 2 for the experiment with duplicates, and 3 for that with triplicates for each case and control condition. v i is then defined to consist of read count profiles from case and control samples as below.
To compute a pseudo value for the case of Gaussian copula, a library of error function (erf) can be used in reactIDR. A cumulative standard distribution can be obtained by erf as follows.
where ϕ(x) and Φ(x) is a standard normal density and standard normal cumulative distribution function, respectively. Also, we define a joint cumulative probability function based on two-and three-dimensional Gaussian copula (corresponding to the case of K = 2 and K = 3) assuming uniform correlation structure.
where f (K) is a multivariate normal distribution as shown below: where µ, σ 2 , and ρ are the parameters to be fitted to an observation u. Following [7], we assume spurious signals on the replicates to be independent, that is, ρ = 0, while ρ to be positive since true signals are positively correlated, or reproducible among replicates. A marginal cumulative distribution function for each x k is simply obtained for the Gaussian copula as follows.

Gaussian mixture copula in IDR
The irreproducible discovery rate (IDR) is a method to classify each signal into a true (highcoverage and reproducible) or false (low-coverage and irreproducible) signal based on the Gaussian mixture model by associating observation with the cumulative joint probability distribution and pseudo-value x . For the case of the mixture copula of two (spurious and true) signals assumed in IDR, F k , g, and a function of the vectorized inverse of F k F −1 are defined as below: where q corresponds to the ratio of true and false signals in the candidates. Then, an expected likelihood function Q for the observation of v i is as follows; where P(v i | rep, θ) and P(v i | irep, θ) corresponds to the probability that the ith signal is reproducible and irreproducible, respectively. The third line of Eq. 1 is defined as local irreproducible discovery rate (idr) in Ref. [7]. The criteria of IDR is defined as follows in analogy to the marginal false discovery rate.
Since IDR does not suppose any contextual dependency on each transcript, the product of likelihood function for all positions of each transcript is aimed to be maximized by EM algorithm with pseudo-data.

Advantage and limitation of IDR
In the previous study, IDR computation was applied to extract reproducible peaks from several sequencing-based analyses such as ChIP-Seq and CAGE-Seq [1,8]. Although the subject of HTS is different from them, a similar tendency, such as irreproducible read counts observed in low coverage regions, is observed due to the systematic bias of sequencing and library preparation.
In addition, IDR has the advantage that it can be estimated on the space of joint distribution of random variables that should be only marginally uniformly distributed between (0, 1) for the copula. This means that IDR evaluation is available for a variety of score distributions, such as read coverages, the ratios of read coverage between different samples, and their p-values, even though the information on the magnitude of score ranges is discarded by rank transformation in IDR.
To compare the samples in multiple conditions, such as case and control samples of HTS, however, another approach is necessary to utilize IDR concepts effectively because the existing IDR cannot handle the comparison of samples from different conditions. To extend the concept of IDR to that comparison, we developed reactIDR, a novel algorithm designed to extract reproducible true signals from HTS data, including local consistency of IDR as on the HMM. In the following subsection, we describe the detailed algorithm of reactIDR and its optimization process.

reactIDR
reactIDR is a novel algorithm designed to extract reproducible true signals from HTS, including local consistency of IDR as on the HMM. As summarized in Figure 1 in the main paper, reactIDR receives a wide range of datasets as long as they contain replicates, such as the HTS dataset obtained from multiple conditions. In reactIDR, each latent variables in HMM, except for unmapped class correspond to each status of the RNA secondary structure. Specifically, that enriched in the case and control corresponds to loop-and stem-like (or background-noise) class, respectively, where the case and control samples correspond to K/2 samples are nuclease S1 (V1) treated in PARS dataset, and reagent (DMSO) treated in icSHAPE dataset among K experiments. Each latent class is hereafter referred as unmapped, stem, and loop. The set of these three classes is defined as S. Notice that the control samples in probing-based analyses are not necessarily enriched in stem regions, unlike cleavage-based methods. However, there are also several studies that a RT is likely to stop at stable base pairs [9].
In reactIDR, loop, stem, and unmapped classes are estimated to make the enrichment of v i,k in the case, control, and no specific enrichment be observed more frequently at the region of each class, respectively. Therefore, according to the likelihood of being loop or stem from the mixture of reliable and spurious signals, reactIDR classifies each position into class s (∈ S). When all the parameters in reactIDR are represented by θ, a likelihood of a specific path of latent variables for all nucleotide positions h is formulated as below: where h 0 and h L are always assigned to the unmapped class (P(h 0 |θ) = P(h L |θ) = 1), because sequencing reads are rarely mapped into the region around 5 ′ -and 3 ′ -end. P(h i+1 |h i , θ) and P(v i |h i , θ) is a transition probability between h i and h i+1 and an emission probability of v i , respectively (detailed later). Then, a log likelihood ll and Q function Q are obtained as below: where H is the set of possible paths of latent variables and θ ′ is an old set of parameters obtained in the previous iteration. The goal of reactIDR is to estimate the parameters maximizing Eq. 2 and optimal latent classes based on their posterior probabilities. Figure 5 is a schematic illustration of the relationship between reactIDR parameter sets. θ consists of a transition matrix and two sets of copula parameters θ c a se and θ cont , which corresponds to the first and latter half of samples, respectively. More specifically, θ is defined as follows:

Transition and emission probability
where R is a transition matrix between each latent class pair. Among the parameters, R is mainly required for the transition probability while θ case and θ cont are used for emission probability. The emission is defined as follows: where v case and v cont is expected to show an enrichment of loop-like and stem-like (or background noise) regions, respectively. These probabilities of rep and irep signals are defined as they are in IDR as follows: where h = 1 when j = case and h = 2 when j = cont. In reactIDR, µ j is restricted that all elements of µ are same, as that in IDR. Note that each probability is altered to be 1/q or 1/(1 − q) times as high as the original probability defined in IDR. This is because the influence of the proportion ratio of reproducible and irreproducible signals is considered in the transition probability in reactIDR. On the other hand, a transition probability is just parametrized by p(h ′′ |h ′ ) in R, which satisfies the probability condition: In R, the transition probability is altered when the transition occurred from 5 ′ to 3 ′ and from 3 ′ to 5 ′ , to take into accounts the asymmetrical sequencing efficiency.

EM algorithm
Using the EM algorithm, each parameter is iteratively optimized to maximize Q function in reactIDR. In E step, a responsibility is computed for the expectation of log likelihood. Then, each parameter is moved to the direction of the derivative of Q function in M step. While the arguments of the maxima are sought in the EM algorithm, the optimization of copula parameters calls for re-computation of the pseudo-values. Due to the long re-computation time of pseudovalues, reactIDR actually adopts a strategy to set a limitation on the number of parameter optimizations in a single iteration, referred to as generalized EM [10], by fmin_l_bfgs_b library in SciPy.

E step
In this step, the expectation of log likelihood Q is calculated: Of the three terms in the above formula, the first term is 0. The second and third term are obtained as follow: where a responsibility γ is defined for each position and transition state as below: where I is an indicator function. These responsibilities can be obtained by forward-backward algorithm effectively as follows.
Then, a responsibility is obtained as below.

M step
In this step, a new optimal θ is obtained to maximize the expectation of log likelihood, resulting in fitting the model to the observation. The new θ should satisfy the equation The posterior for an old parameter θ ′ , such as γ(h ′ i |V, θ ′ ) and γ(h ′ , h ′′ |V, θ ′ ), has been already computed in the previous step.

M step: optimization of the transition matrix for transition probability
To obtain the optimal solution of transition probability p(h ′′ |h ′ ), Q function is differentiated by p(h ′′ |h ′ ), considering the constraint that the sum of p(h ′′ |h ′ ) is equal to one by introducing a Lagrange multiplier: is an expectation count of transition between h ′ and h ′′ , and λ is a Lagrange multipler. This leads to the following equation regarding λ: Because λ is an only variable, the relationship between p(h ′′ |h ′ ) for any h ′ and h ′′ is derived as follows: Thus, an optimal solution of transition probability is simply the expectation ratio of transition between each state. Then, we define a transition matrix R that consists of these transition probabilities p.
In the HMM model, a parameter of reproducible peaks, q (π in the original paper) should correspond to the ratio of latent variables. In this model, therefore, q loop and q stem are estimated after updating R. When R satisfies the probability condition for each row, it can be converted into the orthogonal matrix of eigenvalues D by a matrix P: 3 are eigenvectors, and λ 1 , λ 2 , λ 3 are eigenvalues of R. The expectation vector of the duration time d (d unmapped , d loop , d stem ) for V is obtained as follows: When the sequence length L is so long that the expectation of duration time can be approximated by the equilibrium distribution, it is obtained by the eigenvector d, which corresponds to the largest eigenvalue of one. Specifically, the ratios of reproducible samples for stem and loop are obtained as follows, respectively: where q stem and q loop are always obtained to be 1/3.

M step: optimization of the copula parameters for emission probability
Since there is assumed to be no dependency between the reproducibility strength of case and control sample, reactIDR individually optimizes the parameters for the mixture copula model µ, σ, and ρ a for each sample (θ c a se and θ cont ). A derivative of Q in terms of copula parameters θ j is as below.
We describe the derivative of Q for some position i because Q (and its derivative) can be obtained by the sum of Q for each position.
Let define v i,c a se as u case , v i,cont as u cont , and h i as h ′ for simplicity. In the logarithm form, a derivative of P(u j |h ′ , θ j ) (defined in Eq. 3) by θ j can be rewritten as follows: R, I, and S are defined as follows where h = 1 when j = case and h = 2 when j = cont, and Φ(x) is a standard normal cumulative distribution function. Since each F −1 k also depends on θ, a derivative of log P(u j |h ′ , θ j ) is obtained as follows: Therefore, three variables appeapred in Eq. 4 are required to compute the derivative of emission probability with respect to each copula parameter θ. For example, the general case of k t = 2 is obtained following Taylor expansion and chain rule as follows.
where f corresponds to log P(u j |h ′ , θ j ) in reactIDR.

Case of two-dimensional mixture copula
In this subsection, we will explain how to obtain each term in Eq. 4 for a two-dimensional mixture copula, in which K = 4 and k t = 2. For simplicity, we omit the subscript not only i but also j and h from the parameters. Thus, u, θ, r, µ, σ 2 , ρ, and q correspond to u case , r case , θ case , µ 1 , σ 2 1 , ρ 1 , and q loop for the case samples, and u cont , r cont , θ cont , µ 2 , σ 2 2 , ρ 2 , and q loop for the control samples. It is because a derivation of Eq. 4 can be carried out in the same way for the case of j = case and j = cont. In addition, all elements of µ are set to µ, as those defined in IDR. We describe the way to compute a derivative of Q by Eq. 4 following the order shown below.

Compute
Firstly, the step 1, or computing is done as follows.
can be calculated as follows: The numerator of the above equation can be formulated for each parameter, µ, σ 2 , and ρ in the following ways (as the step 1.1).
Next, the derivative of emission probability with respect to x 1 can be obtained as follows (as the step 2): When there are two replicates for two conditions, several variables required for the derivative of emission probabilities are formulated as below: The derivative of emission probability with respect to x 2 can be easily obtained by exchanging x 1 and x 2 , and S 1 and S 2 . Lastly, the derivatives of with respect to a copula parameter is obtained as shown below (as the step 3).
Let define s as below.
Then, a derivation with respect to µ is formulated as follows.
A derivation with respect to σ 2 is formulated as follows.
A derivation with respect to ρ is formulated as follows.

Case of three-dimensional mixture copula
In this subsection, we will explain how to obtain each term in Eq. 4 for a three-dimensional mixture copula, in which K = 6 and k t = 3. When there are three replicates, several variables required for the derivative of emission probabilities are formulated as below: In a similar way to the case of two-dimensional copula, the derivatives of R, I, and S with respect to x 1 are obtained as follows.
The derivative of emission probability with respect to x k can be easily obtained by replacing x 1 and x k , and S 1 and S k .
The derivatives of R, I, and S with respect to each copula parameter are obtained as shown below. A derivation with respect to µ is formulated as follows.
A derivation with respect to σ 2 is formulated as follows.
A derivation with respect to ρ is formulated as follows.

Implementation and setting for supervised learning
reactIDR has been implemented in Python 3.x with Cython and publicly available at https:// github.com/carushi/reactIDR with the GPL v2.0 license, which is based on the previous IDR python implementation appeared at https://github.com/nboley/idr. In reactIDR, duplicates or triplicates of HTS data are allowed as the input. To reduce a computational time and avoid an overflow for a genome-wide dataset, reactIDR has equipped two options of multiprocess mode and independent HMM. With the independent HMM, reactIDR is able to do perform "independent" HMM computation, in which a forward-backward algorithm is applied to each transcript while an optimization of parameters is applied for all transcripts at once. The second option enables reactIDR to do multiprocess computation during time-consuming steps such as pseudo-value computation. In addition, the simplest way of genome-wide reactIDR analysis by independently computing IDR for each transcript is enabled with "each" option. We also provide a docker image at https://hub.docker.com/r/carushi/rt_end_counter/ to obtain an input of reactIDR in the appropriate format from the bam file of HTS data. In the docker image, the number of the sequencing reads that start at the subsequent (3 ′ ) base is counted and written into a bed file, considering the quality, soft-clipped bases, and the maximum length of sequencing reads. A detailed manual is available at https://github.com/ carushi/reactIDR.
In this study, we applied reactIDR to the problem of structure prediction for rRNAs. We firstly optimized the parameters of reactIDR by fixing the latent class to the reference structure of 18S rRNA, then the structure score was predicted for target rRNAs. At that time, a training and testing was carried out with the negative data for the minus strand of each rRNA.

Suitability of reactIDR to high-throughput structural analyses
To evaluate the reproducibility of high-throughput sequencing data, assessing the quality of data is widely performed by evaluating how the top-ranked signals are reproducible through a certain correlation coefficient such as Pearson's or Spearman's correlation coefficient. However, that index is occasionally not suitable for measuring the reproducibility, particularly when measuring the data from the heterogeneous experiments, distribution of arbitrary scores (e.g., reactivity score), or that with missing values [1,11]. For example, the severe defect of Pearson's correlation coefficient is a susceptibility to outliers, which exist in varied ranges depending on the choice of scoring scheme. On the other hand, the Spearman's correlation is the nonparametric measure for the rank values of two variables. Due to the rank transformation, the outliers may be less influential on the Spearman's correlation coefficient even though the information on the magnitude of score ranges is discarded. The Spearman's correlation coefficient for high-throughput sequencing data, however, still highly depends on the threshold to filter out the lower-ranked samples to evaluate meaningful reproducibility as well as the difficult setting of a single threshold to multiple experiments. In the IDR paper, the authors describe the reproducibility as the extent to which the ChIP-Seq peak ranks are no longer consistent across replicates in decreasing certain significance. By automatic classification into true or spurious signals based on the Gaussian mixture copula, IDR can quantify the reproducibility only from true(-ish) signals depending on the consistency of the samples at each rank range without an arbitrary threshold for each distribution. Moreover, the sequencing reads obtained from HTS analyses show a peak-like distribution similar to that of ChIP-Seq, where IDR is originally intended to apply. Taken together, the existing IDR method and reactIDR can be considered theoretically applicable to the HTS analyses.
In this study, we applied reactIDR to handle both types of PARS-like (PARS) and SHAPElike (icSHAPE) methods. For efficient optimization in a supervised learning, the enrichment of truncated reads from case and control samples were associated with the loop and stem class respectively. The control samples of icSHAPE correspond to the condition without any chemical treatment while those of PARS are S1 treated, respectively, in which the control samples do not correspond to literally control samples. A latent class stem is, thus, supposed to be a position of background (bg) noise in icSHAPE control samples as described in the main paper, but reactIDR assigns stem to be base-paired during supervised learning for efficient parameter optimization. This is because reactIDR assumes control samples as the subject competing with case samples. For the characteristics of this basis, reactIDR utilizes stem class in order to remove false positives from SHAPE-like methods while stem structures are detected by PARS-like methods. Additionally, there is an issue of no original unmapped locus in a supervised learning. Thus, we used the complementary transcripts of each rRNA and the read count distribution aligned to them as the next best. If there is a prior knowledge of unmapped region such as spike-in or negative controls, reactIDR can optimize its parameters based on that information.
2 Data processing 2.1 PARS score computation PARS data consist of two kinds of sequenced samples; one treated with S1 nuclease to measure accessibility, and another treated with RNase V1 to measure the structure stability at each position [12]. To calculate the PARS score, a raw read depth is first normalized with regards to the number of total mapped reads for each condition (represented as c PARS_condition (i, t)) [12]. In the first PARS study, PARS score is calculated as below: where c PARS_V1 (i, t) and c PARS_S1 (i, t) correspond to the normalized read coverage of a V1 (S1) dataset at the ith nucleotide (detailed in Ref. [12]). When replicates are available, c PARS_V1 and c PARS_S1 are obtained using an average or sum of read counts. The PARS score is then capped to ±7 for scaling. In Ref. [13], however, the PARS score is redefined as a five-base average of the previous PARS scores, as below: In this study, the latter PARS score r PARS was computed using the mean of c PARS_V1 and c PARS_S1 to compute a single score from replicates. This is because such averaging can buffer local ambiguity of read mapping in exchange for the detailed resolution. Actually, the consistency of PARS with computational prediction was already demonstrated in ParasoR research [14].

icSHAPE score computation
Another method, icSHAPE [15,16], detects flexible nucleotides using NAI-N 3 reagent by observing the enrichment of RT drop-off compared to the control DMSO sample. In [16], a reactivity score r icSHAPE_condition (i, t) is defined as below.
where c icSHAPE_condition is a normalized read count, c icSHAPE_background, DMSO (i, t) is the number of read counts that read through the ith nucleotide, and α is a parameter to adjust the strength of background control and set it to 0.25 in this study, as optimized in the previous study [15]. c icSHAPE_condition and c icSHAPE_DMSO is the normalized read coverage, or the ratio of RT stop read counts against the whole sequencing library. At this time, c icSHAPE_condition and c icSHAPE_DMSO are scaled so that the mean of 90-95 % most reactive bases across the library are 1. Then, the top 5-95 % of r icSHAPE_condition is scaled to the range of [0, 1] for each transcript independently by 90 % Winsorization (the bottom and top 5 % bases are set to 0 and 1, respectively). This 90 % Winsorization procedure is performed for all positions without 32 bases at the 5 ′ and 3 ′ ends of the transcript, due to the difficulties in appropriate mapping. When c icSHAPE_background, DMSO (i, t), which does not include cDNA reads that stop at the ith nucleotide, icSHAPE score(i, t) is excluded from the dataset. In addition, due to the bias of read mapping, icSHAPE scores at the very 5 ′ -and 3 ′ -ends are also excluded.

Log dor ratio computation for BUMHMM
BUMHMM is developed for the robust reactivity computation considering replicate information [17]. BUMHMM employs the log ratio between the drop-off rates at the same nucleotide in a pair of control replicates (log dor ratio, LDR) as an input. Note that this ratio is computed for each position by dividing the raw read count by the read coverage at the same position while the background density appeared in the icSHAPE score computation is the read coverage at one base downstream. Additionally, a sequence information was given since BUMHMM performed construction of null distribution depending on the sequence context.

Multiple high-throughput structural methodology comparison and computational prediction
To investigate the consistency between multiple HTS, we used the Structure Surfer database [5], in which the structure scores obtained by icSHAPE [15], DMS-Seq [3], PARS [13], and ds/ssRNA-Seq [5] are mapped to human and mouse genomes and combined with ucsc gene ids. From the entire of dataset, we selected parts of studies analyzing human transcriptome, which consisted of two replicates of PARS (rep1 and rep2), DMS-Seq, and the ds/ssRNA-seq dataset. To compare experimental and computational structure analyses, we used ParasoR [14] for prediction of stem probability, which is applicable to long RNAs. Using ParasoR, p stem was computed for each transcript of the hg19 knownGeneMrna dataset downloaded from the UCSC database [18], with a maximal span of W = 200. Then, we analyzed the pairwise correlation of the structure scores of experimental and computational analyses, only for transcripts that have the same length in both datasets. The transcripts in which less than 100 nucleotides or less than half of the regions are annotated by structure scores were excluded from the comparison. After removing the no-score regions, the remaining transcripts were shown in Table 1. Because DMS-Seq and ds/ssRNA-Seq detect the structure score of each nucleotide while other scores correspond to double-strandedness, negative DMS-Seq and ds/ssRNA-Seq scores were applied for the calculation of correlation coefficients. To scale each score into the range of (0, 1), we computed the normalization factors of the minimum and maximum structure scores smin and smax, then normalized structure scores as follows.

Mapping and quantification of PARS and icSHAPE datasets
For whole transcriptome analyses, we downloaded the PARS score dataset for human transcriptome, as measured in the previous study (GEO accession number is GSE50676) [13]. Normalized read counts of GM12878 native deproteinized replicates with nuclease S1 and V1 treatment (hereafter referred to as S1 and V1, respectively) for two replicates. To remove the influence of indiscriminative multi-mapping reads, we realigned sequencing reads of downloaded fasta files for the set of rRNA repeat unit sequences and human transcriptome. The human transcriptome reference consisted of UCSC Refseq sequences (as of October 7, 2016) and GENCODE transcript sequences (v12), with the reference database constructed as described by Wan et al. [13]. To construct the reference set of the rRNA sequences, a human ribosomal repeating unit (NT_167214.1) was extracted from the NCBI database, and a 5S rRNA sequence (ENST00000364451) was additionally downloaded from the Ensembl database. The process of read counting from the sequencing read data already aligned to the reference sequence was performed by using the Docker image, published with reactIDR. Read counts were then normalized by total counts mapped to the rRNA repeat unit and transcriptome. As a representative of RT stop-based reactivity detection, the icSHAPE dataset was compared with the reference structure and PARS dataset. In this study, the dataset of the previous icSHAPE study for the HEK293T cell was analyzed for the comparison with the in vitro PARS dataset (GEO accession number is GSE74353) [19]. We downloaded fastq files generated from three conditions with two replicates, which were DMSO-, in vitro NAI-N 3 -, and in vivo NAI-N 3treated cells. Then, specific barcodes that were 13-nt in length were removed from 5 ′ -ends of RNA fragments after excluding PCR duplications from the sequence library. RT stop counts were measured for the counts of mapped reads using bowtie2 [20] to the human transcriptome reference constructed for PARS analyses. According to the previous study, it is suggested that a large fraction (more than 20 %) of cDNAs tend to have mismatches at the extreme 3 ′ end due to the terminal transferase activity of the reverse transcriptase [21]. To confirm the abundance of flanking nucleotides observed in icSHAPE and PARS reads, the local alignment of bowtie2 was also performed. The amount of flanking nucleotides of each length was determined by the number of soft-clipped bases in bam files for each 5 ′ end.

Command lines for read mapping
For the PARS dataset, only the first R1 reads truncated to the first 50 bases were mapped with the allowance of one mismatch using bowtie2. A box below shows an example of command lines for the PARS data processing. For the icSHAPE dataset, the sequencing reads that are completely duplicated were removed as PCR duplications before barcode truncation, because the number of barcodes incorporated in icSHAPE is large enough that the change to two random fragments to be concatenated with the same barcode is very low ( 1 4 9 ) [16]. Sequenced reads whose first 13 nucleotides were truncated, were mapped to the reference of transcriptome and rRNA repeat unit using local alignment of bowtie2. The number of reads assigned to each position was counted allowing a single flaking nucleotide at their 5 ′ -end. A box below shows an example of command lines used for icSHAPE data processing. 3. trimmamotic v0.36 [22] 4. samtools v1.3.1 [23] 5. bedtools v2.18 [24] 2.7 Accuracy evaluation of reactIDR structure prediction As a rRNA structure reference, cryo-EM-based ribosomal structure (PDB ID: 4v6x) was aligned to our reference sequence [25] . To obtain base-to-base correspondence between the reference sequences and the structure, the 5S, 5.8S, 18S, and 28S rRNA reference sequences were aligned to each sequence within the structure dataset, and all bases successfully aligned to the reference were used for the evaluation of classification. Of the four rRNAs in the cryo-EM ribosomal structure, 18S rRNA reference structure was used as the training set for the supervised learning of reactIDR, with the negative data for the minus strand of 18S rRNA. The annotations of secondary structure and base pairings were obtained for this structure using RNAview [26] with the RNApdbee interface [27]. The accessible surface area (ASA) of the ribosome structure was calculated for each nucleotide using NACCESS with default parameter settings [28].
To evaluate the accuracy of structure classification based on reactivity, we determined a receiver operating characteristic (ROC) curve. The number of true positives, true negatives, false positives, and false negatives are referred as T P, T N, FP, and F N. In the ROC curve, the y-axis corresponds to the true positive rates ((T P)/(T P + F N)) and the x-axis corresponds to the false positive rate ((FP)/(FP +T N)). P-values were computed to compare the AUC of reactIDR and other scoring methods by pROC library in R with a bootstrap method repeating 100 times [29]. To measure the accuracy of in silico structure prediction with or without the structure scores, positive predictive value (PPV) (T P)/(T P+FP), sensitivity (identical to the true positive rates), and Matthews correlation coefficient (MCC)

√ (T P+FP)(T P+F N)(T N+FP)(T N+F N)
against all the possible base pairs within the transcript. At that time, pseudo-knot and inter-transcript base pairs were converted to the loop structures in the reference. The accuracy of structure prediction based on multiple HTS datasets was also investigated using support vector machine (SVM) and linear discriminant analysis (LDA) as implemented in the scikit-learn library (depending on LIBSVM [30] and LIBLINEAR [31]).

Read count distribution
To know an applicability of existing case/control comparison methods to HTS data, we examined the PARS data distributions of means and variances of read counts for each base [13]. As a result, the variances were shown to be larger than their means in general, indicating that the read coverage distribution does not follow Poisson distribution ( Figure 6). Moreover, it is shown that HTS data consists of many "unmapped" regions so that there are quite a few positions on the line of y = 2x, indicating the case of read counts observed only either of replicates. For example, a coordinate point (x, y) = (1, 2) corresponds to the case whose read counts were 0 in one sample and 8 in another. This indicates that HTS data requires a careful treatment for such severe sparsity. Although there is some previous study of zero-inflated negative binomial modeling for that purpose, IDR is expected to suitable for reactivity estimation by evaluating how read count is enriched in reproducible way or not. Such "no enrichment" is also supposed to be a clue of conformational status in HTS data.

Multi-comparison between high-throughput structural analyses and in silico structure analyses
In this section, we demonstrate the comparison of structure scores of human transcriptome as produced by many different protocols, using the Structure Surfer database [5]. Structure Surfer is a database of transcriptome-wide HTS for multiple species, and we investigated the basic properties of stored HTS dataset. Figure 1 shows the distribution of structure scores for six HTS. The distributions of structure scores apparently differed between cleavage-based (PARS and ds/ssRNA-Seq) and modificationbased approaches (DMS-Seq and icSHAPE). In particular, the score distributions of modificationbased approach shows a long tail in one direction. On the other hand, those of PARS and ds/ssRNA-Seq have a median at the middle of distributions because such methods are able to detect discriminative stem and loop regions through the log ratio of the enrichment about single-and double-stranded regions. These differences should be aligned carefully for the cross-comparison between different methodologies, or they might be the cause of spurious inconsistency.
Next, we evaluated the consistency of genome-wide structure analyses on human transcriptome data. We calculated Spearman's correlation coefficients for the pair of vectors of reactivities assigned to the same location across each human transcript without defect regions. In addition to reactivities measured by experiments, we compared stem probability estimated by ParasoR software in silico to confirm the consistency of the thermodynamic folding model and experimental structure analyses. As a result, most correlation coefficients were distributed around 0, indicating random correlation between different studies (Figure 2). PARS replicates, however, showed clearly higher correlation among them compared to others. Also a moderate positive correlation was observed between PARS and ds/ssRNA-Seq. This tendency was robust to the way to select target regions, such as selecting only transcripts depending on their lengths or a specific type of nucleotides (data not shown). In addition, there are a variety on the number of detected transcripts among each HTS, resulting large nonoverlapping for comparison (detailed in the section "Data Processing").
It should be noted that each HTS data was obtained from different types of cell in the different condition and must show cell-to-cell variation. Hence, this result at least indicated that just a direct comparison of structure scores over multiple HTS exhibited only a weak correlation, despite these studies sharing a common purpose of revealing the structural landscape of human transcriptome. This is a matter of concern because it might be meaningless to analyze the whole landscape of RNA secondary structure over transcriptome if the most of transcripts are truly inconsistent. However, it is not clear whether such incongruence arises from the heterogeneity of human transcriptome as related to the cellular condition or the systematic biases of each method. In fact, RNA secondary structure is known to be variable depending on environmental elements, such as temperature, pH, or interference of binding factors. However, since replicates of HTS showed substantially high correlation in this study, it is considered that most transcripts form a consensus secondary structure that is stable enough to be detected by probing or cleavage.
Moreover, we examined the mismatch occurrence at 5 ′ -end of sequencing read. To evaluate the ratio of flanking base at 5 ′ -end of RNA (or 3 ′ -end of reverse transcribed DNA) fragment, we performed sequence alignment of reads from the PARS and icSHAPE datasets to a reference of rRNAs. Using bowtie2 with local alignment mode, the number of soft-clipped bases was counted as the number of flanking bases in local alignment mode. As a result, a greater number of reads contained soft-clipped bases in the icSHAPE datasets than the PARS dataset, particularly for the icSHAPE dataset treated with DMSO ( Figure 7). This result is consistent with the previous report on flanking nucleotides by the terminal transferase activity [21]. Because it might also reduce the resolution and precision of structure detection, we selected only reads having one or no soft-clip base at 5 ′ -end from icSHAPE datasets in the following analyses.

Consistency of computational and experimental structure analyses based on IDR classification
Next, the consistency between the PARS dataset and computational structure prediction was examined depending on the IDR-based classification. We tested IDR-based classification that a nucleotide is classified into "S1 (accessible)" and "V1 (stem)" when estimated IDR was lower than 0.01 for S1 and V1 dataset, respectively. We also defined "Neutral" class as the samples included in neither accessible nor stem, or those belonging to both accessible and stem. Figure 8 shows the distribution of stem probabilities for each class based on IDR computed for the S1 and V1 datasets. Consequently, the median of stem probability clearly increased in the order of accessible, neutral, and stem. Although there are still a certain number of inconsistent nucleotides between stem probability and PARS score, they might be influenced by complex secondary structures, such as long-range base pairs or RNA binding protein interaction, as well as the failure of computational prediction.  Table 3. The area under the ROC curves (AUROCs) of rRNA structure prediction for the test set by structure scores based on eight kinds of reactivity scoring methods: Count, icSHAPE or PARS score (Score), BUMHMM [17], PROBer [32], IDR, and reactIDR with the parameters initialized by supervised learning and fitted to each transcript (reactIDR). In addition, the prediction accuracies are shown here for several variants of reactIDR: reactIDR with the parameters initialized by supervised learning and without any further fitting (Supervised), that with the parameters fitted to each transcript (Unsupervised), and that with the parameters initialized by supervised learning for the PARS dataset then fitted to each transcript. AUROCs for the structure classification of the test set (28S, 5S, and 5-8S rRNAs, shown in the top panel) and training set (18S rRNA, shown in the bottom panel) were computed while the structure status to be enriched (i.e., loop or stem) was set to positive. Each column corresponds to the AUROCs for the prediction of accessible loop sites for in vivo and in vitro icSHAPE datasets and stem and loop regions for the PARS V1 and S1 dataset, respectively. In addition, we applied the prediction of structure scores without control for the PARS S1 and PARS V1 dataset, shown in S1 only and V1 only. The AUROCs shown of BUMHMM and PROBer in gray were obtained by the methods not originally designed for the PARS experiments. Those of reactIDR and Supervised for 18S rRNA structure prediction are also shown in gray because they used the information of 18S rRNA structure (the training set) through the optimized parameters by supervised learning. In other words, their test set and training set are same. † represents the structure score used for evaluation is common among the case (loop-enriched) and control (stem-enriched) condition. However, even those common structure scores could produce different AUROCs between stem and loop prediction from the PARS S1 and PARS V1 dataset. This is because all nucleotides were not symmetrically classified by assuming the undeterminable nucleotides (e.g., missing data) were the least likely candidates.  Sequencing read manipulation in this study RNA Figure 3. (Left) reactIDR receives the input data consisting of 5 ′ -end read depth repeatedly measured in the case and control conditions. The input data is first converted into the scaled ranking data to adapt the joint cumulative probability distribution of copula model applied in the irreproducible discovery rate (IDR), which discriminates true signals from irreproducible false signals. (Right) In reactIDR, IDR is extendedly combined with the hidden Markov model of the three latent classes (stem, loop, and unmapped) to handle case/control comparison, in which each parameter is optimized to the data and reference structure by expectation-maximization algorithm. Then reactIDR provides the posterior probability of being loop and stem, which is feasible as a structure score for in silico structure prediction.  Figure 5. In reactIDR, the parameters to be optimized can be classified into roughly three sets, a transition matrix R and two copula parameter sets θ case and θ cont . In each M step of EM algorithm, R is first optimized by the expected transition frequency, then the ratios of reproducible signals q loop and q stem are computed. θ case and θ cont are individually optimized because they are shown in the separated terms in the equation of Q function.

Tables
log10(Mean of S1 coverage+1) log10(Variance of S1 coverage+1) (a) log10(Mean of V1 coverage+1) log10(Variance of V1 coverage+1) (b) Figure 6. Relationship between log 10 mean and log 10 variance of the read count distribution shown in the x-and the y-axis, respectively. (A) and (B) correspond to those of (read count +1) from the PARS S1 and V1 dataset for each nucleotide among the replicates, respectively. Red lines are the y = x lines, which correspond to the equal mean and variance of Poisson distribution.   Three groups shown in x-axis corresponds to the group of being enriched in case, control, or otherwise classified by IDR criteria [1]. Distribution of stem probability based on the IDR-based classification into three classes of accessible, neutral, and structured. Each nucleotide is classified as accessible and structured when estimated IDR is lower than 0.01 for S1 and V1 dataset, respectively. After the classification, the samples included in neither accessible nor structured, or those belonging to both accessible and structured were classified into neutral class.   Figure 10. Precision-recall curve of rRNA structure prediction for the test set by structure scores based on eight kinds of reactivity scoring methods as the analysis performed in Figure 9.    AUROC of E. coli rRNA structure prediction for each reactivitiy score based on E. coli SHAPEmutational profiling (MaP) data obtained in Ref. [33] and structure references in CRW database [34]. They published three types of SHAPE-MaP data, in which the reagent 1-methyl-6-nitroisatoic anhydride (1m6), 1-methyl-7-nitroisatoic anhydride (1m7), and N-methylisatoic anhydride (NMIA) was individually applied to rRNAs. Using those data, the prediction of loop regions was performed for 5S and 23S rRNA using the parameters optimized for 16S rRNA in a supervised way, and all rRNAs in an unsupervised way. The performance was evaluated for six types of reactivity scoring methods by calculating AUROCs. To compute the structure scores, reactIDR and IDR took each pair of SHAPE-MaP data as the replicates (shown in the left panels) or all of data as the triplicates (shown in the most right panels). "1m6", "1m7", and "nmia" corresponds to the prediction accuracy using the reactivity scores of the samples treated with 1m6, 1m7, and NMIA, respectively, while "count" corresponds to that using the average of the reactivity scores.