reactIDR overview
A schematic workflow of reactIDR is presented in Fig. 1c and Additional file 1: Figure S3. We aimed to develop reactIDR so that it can be applied for solving HTS problems as follows:
-
Input – Assuming that K is the number of samples in an HTS experiment, and the sequencing reads were obtained from two K/2 samples, under two different conditions (and an optional reference structure), each read can be mapped against a reference sequence, and the indicator of reactivity at each base can be measured by counting the reads that start at the subsequent (3′) base.
-
Output – The output represents the probability that each nucleotide is the latent class loop.
-
Objective function – To provide reliable structure scores for accurate structure prediction, the outputs should be optimized so that they satisfy the following conditions: (1) the sequential loop regions of the RNAs provide consistently enriched read counts under the case conditions, in contrast to those under the control conditions, while stem motifs should be enriched in the control obtained from the PARS dataset, and (2) the true signals of structural footprints should be consistent among the replicates with the higher expression.
Using the HTS approach, the positions in stem (base paired) or loop (accessible) structures can be detected by the enrichment of chemical or enzymatic footprints, since their reactivity is highly affected by the existence of base pairs. However, to remove many false positives derived from the random occurrences of fragmentation or endogenous modification, reactIDR further considers the ratio of read count enrichment between two-conditional samples. As shown in Fig. 1, case (control) samples were specifically defined as the first (last) K/2 samples with the sequencing reads stochastically truncated at one base downstream from the loop (stem or non-specific background, represented as bg). In reactIDR, a latent class loop, stem/bg, or unmapped, is expected to be assigned to the region corresponding to the case-derived, control-derived, and no sufficient reads obtained, respectively. Here, icSHAPE case and control samples were regent-treated and untreated, respectively, while PARS samples were treated with S1 and V1, respectively, and the control samples were not the untreated samples.
IDR
Our novel algorithm, reactIDR, extends the idea of the IDR [13] criteria to obtain reproducible and irreproducible classification. We first introduce the model for the IDR estimation. Here, as an input of IDR and reactIDR, we considered the read-count data obtained for a single transcript or concatenated multiple transcripts with the total length of L nucleotides. As these data were initially converted into the ranking data across all positions in each experiment, these ranking data in an ascending order were scaled to the range of [0,1]. These normalized data were hypothesized to correspond to the cumulative probability distribution in the copula model, necessarily within that range. After scaling, a vector of the ranks of read counts at the ith position was defined as vi:={vi,1, …, vi,K} for IDR, and vi:={vi,case,vi,cont}={{vi,1, …,vi,K/2},{vi,K/2+1… vi,K}} for reactIDR. The overall input data of the IDR and reactIDR were: V={v1, …, vL}. IDR is based on the use of the mixture copula model to discriminate between one reproducible (represented by rep) distribution and the irreproducible (irep) distributions, recognizing them as true and false signals, respectively. Each copula can explain the distribution of true, reproducible or false, poorly correlated, signals. This is because true signals obtained from the protein binding sites in ChIP-Seq should exhibit reproducible enrichment with an increase in the read-count depth and a high correlation between the replicates. Based on this, the signals categorized into rep are thought to contain many more true signals compared with those in the irep group.
For simplicity, we considered an example of the mixture of two Gaussian copulas in two dimensions hereafter, showing that our approach can be used for the analysis of two experimental replicates. When the ith signal is produced by true (false) signals, represented as ri=true(false), the distribution of a random variable xi=(xi,1,xi,2) corresponding to the ith signal is as follows:
$$ \begin{aligned} \left(\begin{array}{c} x_{i,1} \\ x_{i,2} \end{array}\right) {\Bigg\rvert}{h} \sim \mathcal{N} \left(\left(\begin{array}{c} \mu_{h} \\ \mu_{h} \end{array}\right),\left(\begin{array}{cc} \sigma_{h}^{2} & \rho_{h} \sigma_{h}^{2} \\ \rho_{h} \sigma_{h}^{2} & \sigma_{h}^{2}, \end{array}\right)\right) \end{aligned} $$
(1)
where μh, σh, and ρh represent mean, variance, and correlation, respectively, with the indicator h for each state of ri (h=1 when ri=true and h=0 when ri=false). The distribution of true rep signals is assumed to contain a positive mean and correlation (μ1>0 and 0<ρ1≤1), while spurious irep signals should be located around the low-coverage regions and show increased variance (\(\mu _{0}=0, \ \sigma _{0}^{2} = 1\), and ρ0=0). These parameters can be fitted not to the actual read counts, but to the normalized ranks of read coverage for ChIP-Seq data v, such that those ranks are derived from the cumulative marginal probability distribution function of xi,k (k=1,2), as shown below:
$${\begin{aligned} F_{k}(x_{k}) := \ v_{i, k} = {\int}_{-\infty}^{x_k} q\mathcal{N}\left(x^{\prime}_{k}|\mu_{1}, \sigma_{1}^{2}\right)+(1-q)\mathcal{N}\left(x^{\prime}_{k}|\mu_{0},\sigma_{0}^{2}\right) dx^{\prime}_{k}, \end{aligned}} $$
where q represents the ratio of true and reproducible signals in the samples and k represents the k-th experiment. By considering the contribution of the second Gaussian distribution representing irreproducible signals, a probability of each peak derived from the irreproducible signal or IDR, can be estimated.
reactIDR: an algorithm based on the IDR and HMM
While IDR is a powerful method for the evaluation of the reliability of various joint distributions of ChIP-Seq peaks, HTS data used for RNA stem/loop detection must additionally include the consideration of the dependency between the consecutive nucleotides over multiple conditions. The application of IDR to HTS analyses, therefore, requires further specific extensions for the more complicated situations. To this end, we developed reactIDR, a novel method for the extraction of true reactivity signals from the replicated HTS data, based on the determination of statistical reliability score according to the IDR and HMM (Fig. 2).
For L as a total length of transcripts and a vector of scaled rankings at the ith position, vi can be given. In reactIDR, a latent variable at the ith position hi can belong to any element of {loop, stem/bg, and unmapped}, corresponding to each status associated with the HTS data and RNA secondary structures. The enrichment of vi is thought to be most likely observed at the regions that belong to the loop class in the case samples and stem/bg class in the control samples, while we did not expect any specific enrichment in the case of unmapped classes.
In reactIDR, the paths of all latent variables can be represented by h={h0, …, hL}, and the likelihoods of V and h can be obtained as the products of emission and transition probabilities, as formulated below:
$$ \begin{aligned} P(V, \mathbf{h}|\mathbf{\theta}) =& P\left(h_{0}|\mathbf{\theta}\right) \prod\limits_{i=1}^{L} P\left(\mathbf{v_i}|h_{i},\mathbf{\theta}\right) \prod\limits_{i=0}^{L-1}P\left(h_{i+1}|h_{i}, \mathbf{\theta}\right) \end{aligned} $$
(2)
where θ consists of the transition matrix between each latent class pair \(\mathcal {R}\), and the set of all of parameters in the copula model for case samples θcase and control samples θcont, respectively (i.e., μ, σ2, and ρ, see Additional file 1: Figures S4 and S5). Due to the difficulties in the mapping of reads into the edge of transcripts, h0 and hL are always assigned to the unmapped class (P(h0|θ)=1 and P(hL|θ)=1). P(hi+1|hi,θ) represents the transition probability between hi and hi+1 in the HMM, represented by \(\mathcal {R}\), which is further optimized by the expectation of the transition between hi and hi+1 at each step of the EM algorithm iteration. P(vi|hi,θ) is an emission probability, which we defined as follows:
$$\begin{aligned} P\left(\mathbf{v}_{i}|h_{i},\mathbf{\theta}\right) =& P\left(\mathbf{v}_{i, case}|r_{{case}}, \mathbf{\theta}_{{case}}\right) \cdot P\left(\mathbf{v}_{i,cont}|r_{{cont}}, \mathbf{\theta}_{{cont}}\right)\\ r_{{cont}} :=& \ rep \text{ if}\ h_{i} =stem/bg \text{ and}~ irep \text{ otherwise} \\ r_{{case}} :=& \ rep \text{ if}\ h_{i} =loop \text{ and}~ irep \text{ otherwise }, \end{aligned} $$
where vcase and vcont are expected to show the enrichment of loop-like and stem-like (or background noise) regions, respectively. These probabilities for rep and irep signals can be obtained by the mixture Gaussian copula, as when using the IDR.
Joint distribution f(2), cumulative marginal distribution Fk(=vi,j,k), and the function of the vectorized inverse of FkF−1 can be defined as follows:
$$\begin{aligned} f_{(2)}\left(\mathbf{x}, \mathbf{\mu}, \sigma^{2}, \rho\right) =& \mathcal{N}\left(\mathbf{x}| \mathbf{\mu}, \sigma^{2}, \rho\right) \\ F_{k}\left(x_{j,k}\right) =& {\int}_{-\infty}^{x_{j,k}} f_{k}\left(x^{\prime}_{k}\right)dx^{\prime}_{k} \,=\, {\int}_{-\infty}^{x_{j,k}} q_{j}\mathcal{N}\left(x^{\prime}_{k}|\mu_{h}, \sigma_{h}^{2}\right)\\ &+\left(1-q_{j}\right)\mathcal{N}\left(x^{\prime}_{k}|\mu_{0},\sigma_{0}^{2}\right) dx^{\prime}_{k} \\ F^{-1}\left(\mathbf{v_{j}}\right) =& \left\{ F^{-1}_{1}\left(v_{i, j, 1}\right), F^{-1}_{2}\left(v_{i, j, 2}\right) \right\}, \end{aligned} $$
where h=1 (2) when j=case (cont), and qj represents the ratio of true signals in j samples. Afterward, the emission probability based on the mixture Gaussian copula consisting of rep and irep distributions with parameters θcase or θcont can be formulated as follows:
$$\begin{aligned} P\left(\mathbf{v}_{i, j}| \ r_{j}, {\boldsymbol{\theta}_{\boldsymbol{j}}}\right) =& \frac{f_{(2)}\left(F^{-1}\left(\mathbf{v}_{i,j}\right), {\boldsymbol{\mu}_{\boldsymbol{h}}}, \sigma_{h}^{2}, \rho_{h}\right)}{ \prod_{k = 1}^{2} f_{k}\left(F^{-1}_{k}\left(v_{i,j,k}\right)\right)} & \text{ if } r_{j} = \text{ rep} \\ =& \frac{f_{(2)}\left(F^{-1}\left(\mathbf{v}_{i,j}\right), \mathbf{\boldsymbol{\mu}_{\boldsymbol{0}}}, \sigma_{0}^{2}, \rho_{0}\right) }{ \prod_{k= 1}^{2} f_{k}\left(F^{-1}_{k}\left(v_{i,j,k}\right)\right)} & \text{ otherwise,} \end{aligned} $$
where μk>0 and 0<ρk≤1 while \(\mu _{0}=0, \ \sigma _{0}^{2} = 1\), and ρ0=0, the same as those in Eq.1 (further details can be found in Additional file 1). In this way, using reactIDR, we can compute a posterior probability distribution for each latent class at each site as an index of the reactivity and evaluate its reliability. Using the EM algorithm, each parameter can be iteratively optimized to maximize the expectation of likelihood Eq. 2 in reactIDR. Furthermore, reactIDR can incorporate supervised learning at this step, by limiting H so that it is consistent with the reference structure. Specifically, loop, stem/bg, and unmapped class is associated with loop, stem, and regions, respectively, in which stem/bg class is expected to play a role in removing false positives from true loop regions in the icSHAPE datasets. Afterward, trained parameters were used as the initial set of parameters in the re-fitting process for each rRNA sequence of the test set. The details of the optimization process, such as the derivative of Q for each parameter, are described in Additional file 1.
Datasets used and the evaluation of classification based on reactivity indicators
In this study, datasets generated by using two HTS methods, icSHAPE and PARS, were used to validate the accuracy of reactIDR. A whole-transcriptome PARS dataset of the native deproteinized transcriptome of GM12878 cells in vitro was obtained (GEO accession number, GSE50676) [20]. Here, we used the normalized read counts of two replicates treated with nucleases S1 and V1 (hereafter referred to as S1 and V1, respectively). Analyses were also conducted using the icSHAPE dataset obtained for the HEK293T cells (GEO accession number, GSE74353), which contains sequencing reads obtained for three conditions with two replicates: dimethyl sulfoxide (DMSO)-, in vitro NAI-N3-, and in vivo NAI-N3-treated cells [21]. Computation of the original PARS and icSHAPE scores and the characteristics of these datasets can be found in Additional file 1: Section 2 and Figures S6 and S7.
To construct the reference set of the rRNA sequences, a human ribosomal repeating unit (NT_167214.1) was extracted from the NCBI database, and a 5S rRNA sequence (ENST00000364451) was additionally included. As a rRNA structure reference, cryo-EM-based ribosomal structure (PDB ID: 4v6x) was aligned to our reference sequence [22]. To obtain base-to-base correspondence between the reference sequences and the structure, 5S, 5.8S, 18S, and 28S rRNA reference sequences were aligned to each sequence within the structure dataset, and all bases successfully aligned to the reference were used for the evaluation of classification. Of the four rRNAs in the cryo-EM ribosomal structure, 18S rRNA reference structure was used as the training set for the supervised learning of reactIDR, with the negative alignment data for the minus strand of each rRNA, representing the read counts produced by misalignment. The accessible surface area (ASA) of the ribosome structure was calculated for each nucleotide using NACCESS with default parameter settings [23].
To evaluate the accuracy of structure classification based on reactivity indicators, we constructed a receiver operating characteristic (ROC) curve. In the ROC curve, the y-axis corresponds to the true positive rates, while the x-axis corresponds to the false positive rate. P-values were computed in order to compare the area under curve (AUROC) of reactIDR and other scoring methods by using the pROC library in R with a bootstrap method with 100 repetitions. We measured the accuracy of in silico structure prediction with or without the structure scores by positive predictive value (PPV) and sensitivity (identical to the true positive rates) for each possible base pair within the transcript except for pseudo-knots. The accuracy of the structure predictions based on multiple HTS datasets was investigated using the support vector machine (SVM) as well, through the scikit-learn library interface. For in silico structure prediction assisted by the structure scores, RNAfold from the Vienna RNA package v2.4.0 [3] was applied.