Differential RNA methylation data analysis includes the following steps: reads alignment, peak calling (methylation site detection), reads counting and differential analysis. The newly developed QNB package deals with the last step (See Fig. 3). Please note that, this is only one example. In practice, if differential methylation analysis is applied to gene or base resolution, only reads count is needed, and peak calling step will not be necessary.

### QNB model

Let *t*
_{
i , j
} and *c*
_{
i , j
} represent the reads counts of the *i*-th feature (gene or RNA methylation site) in the paired IP and input control sample of MeRIP-Seq data from *j*-th biological replicate, respectively. When the sequencing depths of different samples are the same, we may ignore its influence and have

$$ {t}_{i,j}\sim \mathrm{Binomial}\left({p}_{i,\rho (j)},{n}_{i,j}\right) $$

(2)

where
*n*
_{
i , j
} = *t*
_{
i , j
} + *c*
_{
i , j
}
and
*ρ*(*j*) represents the experimental condition (cell type, tissue or treatment) of the
*j*
-th biological replicate, and
*p*
_{
i , ρ(j)}
denotes the percentage of methylation for the
*i*
-th feature in
*j*
-th biological replicate. The goal of differential RNA methylation analysis for a specific feature is to test whether the percentage of methylation remain the same under two different experimental conditions
\( \mathcal{A} \)
and ℬ, i.e., the null hypothesis
\( {p}_{i,\mathcal{A}}={p}_{i,\mathrm{\mathcal{B}}} \)
.

Considering the over-dispersion effect of sequencing reads count data, *t*
_{
i , j
}and *c*
_{
i , j
}are assumed to follow the negative binomial distribution

$$ {t}_{i,j}\sim \mathrm{NB}\left({\mu}_{t,i,j},{\sigma}_{t,i,j}^2\right) $$

(3)

$$ {c}_{i,j}\sim \mathrm{NB}\left({\mu}_{c,i,j},{\sigma}_{c,i,j}^2\right) $$

(4)

where their means can be decomposed by

$$ {\mu}_{t,i,j}={q}_i{p}_{i,\rho (j)}{e}_{i,\rho (j)}{s}_{t,j} $$

(5)

$$ {\mu}_{c,i,j}={q}_i\left(1-{p}_{i,\rho (j)}\right){e}_{i,\rho (j)}{s}_{c,j} $$

(6)

Here, *q*
_{
i
}represents the expected abundance of feature *i* under all conditions in a standard sequencing library. *s*
_{
t , j
}and *s*
_{
c , j
} represent the size factor of the IP and input control sample of the *j*-th biological replicate and directly reflect their sequencing depth. *p*
_{
i , ρ(j)} stands for risk of RNA methylation, or the true percentage of methylation for feature *i* under condition *ρ*(*j*) on the common scale, i.e., without rescaling by the size factors *s*
_{
c , j
} and*s*
_{
t , j
}. Additionally, *e*
_{
i , ρ(j) }is introduced to model differential expression at RNA level as a feature-specific size factor, which indicates the abundance of feature *i* under a specific experimental condition compared with the standard abundance *q*
_{
i
}.

In this model, the sequencing size factor *s*
_{
t , j
} and *s*
_{
c , j
} of the IP and input control sample can be conveniently estimated from the total number of the reads in a library or using the “geometric” approach developed for RNA-Seq data [23, 25]. The other parameters can be estimated as follows:

$$ {\widehat{q}}_i=\underset{\forall j}{\mathbb{E}}\left(\frac{t_{i,j}}{s_{t,j}}+\frac{c_{i,j}}{s_{c,j}}\right) $$

(7)

$$ {\widehat{p}}_{i,\rho (j)}=\sum_{j:\rho (j)=\rho}\left(\frac{t_{i,j}}{s_{t,j}}\right)/\sum_{j:\rho (j)=\rho}\left(\frac{t_{i,j}}{s_{t,j}}+\frac{c_{i,j}}{s_{c,j}}\right) $$

(8)

$$ {\widehat{e}}_{i,\rho }=\frac{1}{\left|\rho \right|{\widehat{q}}_i}\sum_{j:\rho (j)=\rho}\left(\frac{t_{i,j}}{s_{t,j}}+\frac{c_{i,j}}{s_{c,j}}\right) $$

(9)

where |*ρ*| denotes the number of biological replicates under a specific experimental condition
*ρ*
.

Please note that, compared with the DRME model [26], a more robust estimator for background expression level of the feature is implemented Eq. (7) by taking advantage of both the IP and input control samples. In DRME model, the basal level of gene expression is estimated from the input control sample only, as in theory without anti-body based enrichment, the input control sample of MeRIP-Seq data should contain both methylated and unmodified molecules, and thus corresponds to the true expression level. However, since the reads are usually enriched in the IP samples for a methylation sites to be called, there is usually less reads in the input control samples, and thus the estimator is not robust for very lowly expressed genes. For this reason, the basal level is estimated from the sum of input and IP samples in the QNB model. The robust estimator should largely improve the testing performance for very lowly expressed genes.

Inspired by the DESeq formulation [23], the variance in Eqs. **(**
3
**)** and **(**
4
**)** can be further decomposed into the shot noise and raw variance, i.e.,

$$ {\sigma}_{t,i,j}^2=\underset{\mathrm{shot}\ \mathrm{noise}}{\underbrace{\mu_{t,i,j}}}+\underset{\mathrm{raw}\ \mathrm{variance}}{\underbrace{{\left({e}_{i,j}{s}_{t,j}\right)}^2{\upsilon}_{t,i,\rho (j)}}} $$

(10)

$$ {\sigma}_{c,i,j}^2=\underset{\mathrm{shot}\ \mathrm{noise}}{\underbrace{\mu_{c,i,j}}}+\underset{\mathrm{raw}\ \mathrm{variance}}{\underbrace{{\left({e}_{i,j}{s}_{c,j}\right)}^2{\upsilon}_{c,i,\rho (j)}}} $$

(11)

where
*μ*
_{
t , i , j
}
and
*μ*
_{
c , i , j
}
are the variance of a Poisson distribution, which is often used to model technical replicates in NGS data. Additionally, due to biological variability, the over-dispersion of a Poisson model is represented by (*e*
_{
i , ρ(j)}
*s*
_{
t , j
})^{2}
*υ*
_{
t , i , ρ(j)}
and(*e*
_{
i , ρ(j)}
*s*
_{
c , j
})^{2}
*υ*
_{
c , i , ρ(j)}
, where
*e*
_{
i , ρ(j)}
and
*s*
_{
t , j
}
(or
*s*
_{
c , j
}
) quantify the impact of condition-specific gene differential expression and sample-specific library size (or the sequencing depth), respectively. We consider the per-feature raw variance parameter
*υ*
_{
i , ρ
}
is a smooth function of the expected methylation rate
*p*
_{
i , ρ
}
and the feature abundance
*q*
_{
i , ρ
}
under a specific condition
*ρ*
, i.e.,

$$ {\upsilon}_{t,i,\rho (j)}={\upsilon}_{t,\rho}\left({p}_{i,\rho (j)},{q}_{i,\rho (j)}\right) $$

(12)

$$ {\upsilon}_{c,i,\rho (j)}={\upsilon}_{c,\rho}\left({p}_{i,\rho (j)},{q}_{i,\rho (j)}\right) $$

(13)

For methylation reads count *t*
_{
i , j
} in the IP sample, the variances on the common scale \( {\widehat{w}}_{t,i,\rho } \) can be calculated with

$$ {\widehat{w}}_{t,i,\rho }=\frac{1}{\left(\left|\rho \right|-1\right)}\sum_{j:\rho (j)=\rho }{\left[\frac{t_{i,j}}{{\widehat{s}}_{t,j}{\widehat{e}}_{i,\rho (j)}}-{\overline{q}}_{t,i,\rho}\right]}^2 $$

(14)

where

$$ {\overline{q}}_{t,i,\rho }=\frac{1}{\left|\rho \right|}\sum_{j:\rho (j)=\rho}\frac{t_{i,j}}{{\widehat{s}}_{t,j}{\widehat{e}}_{i,\rho (j)}} $$

(15)

Let

$$ {z}_{t,i,\rho }=\frac{{\widehat{q}}_i{\widehat{p}}_{i,\rho (j)}}{\left|\rho \right|}\sum_{j:\rho (j)=\rho}\left(\frac{1}{{\widehat{s}}_{t,j}{\widehat{e}}_{i,\rho (j)}}\right) $$

(16)

Following the methodology of DESeq model [23], we show in the supplementary materials (Additional file 1) that\( \left({\widehat{w}}_{t,i,\rho }-{z}_{t,i,\rho}\right) \) is an unbiased estimator for the raw variance parameter *υ*
_{
t , i , ρ
}, with

$$ {\widehat{\upsilon}}_{t,i,\rho (j)}\left({\widehat{p}}_{i,\rho },{\widehat{q}}_i\right)={w}_{t,i,\rho}\left({\widehat{p}}_{i,\rho },{\widehat{q}}_i\right)-{z}_{t,i,\rho } $$

(17)

as our estimate for the raw variance parameter
*υ*
_{
t , i , ρ(j)}.

We use a 2-dimensional local regression on the graph \( \left({\widehat{p}}_{i,\rho },{\widehat{q}}_i,{\widehat{w}}_{t,i,\rho}\right) \)to obtain a smooth function of\( {w}_{t,i,\rho}\left({\widehat{p}}_{i,\rho },{\widehat{q}}_i\right) \). Since \( {\widehat{w}}_{t,i,\rho } \)in Eq. **(**
14
**)** is the sum of squared random variable, the residuals of the model\( {w}_{t,i,\rho }-{w}_{t,i,\rho}\left({\widehat{p}}_{i,\rho },{\widehat{q}}_{i,\rho}\right) \) are skewed. Following reference [27] and the practice in DESeq [23], we also implemented a generalized linear model of the gamma family for the local regression with the implementation in R locfit package [28] for estimation of \( {w}_{t,i,\rho}\left({\widehat{p}}_{i,\rho },{\widehat{q}}_i\right) \).

Similar to the estimation of *υ*
_{
t , i , ρ(j)} and *w*
_{
t , i , ρ
} in the IP samples as described previously, the raw variance parameter *υ*
_{
c , i , ρ(j)} and the variance of reads on the common scale *w*
_{
c , i , ρ
} for the input control samples can also be estimated.

### Testing & Metrics

For differential RNA methylation analysis, we consider the null hypothesis that condition \( \mathcal{A} \) and condition ℬ are of the same methylation rate on the common scale, i.e.,\( {p}_{i,\mathcal{A}}={p}_{i,\mathrm{\mathcal{B}}}={p}_{i,\mathcal{O}} \), which can be estimated with

$$ {\widehat{p}}_{i,\mathcal{O}}=\sum_{j\in \mathcal{A}\cup \mathrm{\mathcal{B}}}\frac{t_{i,j}}{s_{t,j}}/\sum_{j\in \mathcal{A}\cup \mathrm{\mathcal{B}}}\left(\frac{t_{i,j}}{s_{t,j}}+\frac{c_{i,j}}{s_{c,j}}\right) $$

(18)

For each feature *i* and replicate *j *of its condition *ρ*(*j*), the reads counts *t*
_{
i , j
}and *c*
_{
i , j
}are considered independently distributed. For differential methylation analysis between condition \( \mathcal{A} \) and ℬ, we construct 4 random variables following negative binomial distributions for the IP and input control samples under two experimental conditions, respectively, i.e.,

$$ {t}_{i,\mathcal{A}}=\sum_{j\in \mathcal{A}}\left({t}_{i,j}\right)\sim \mathrm{NB}\left({\widehat{\mu}}_{t,i,\mathcal{A}},{\widehat{\sigma}}_{t,i,\mathcal{A}}^2\right) $$

(19)

$$ {t}_{i,\mathrm{\mathcal{B}}}=\sum_{j\in \mathrm{\mathcal{B}}}\left({t}_{i,j}\right)\sim \mathrm{NB}\left({\widehat{\mu}}_{t,i,\mathrm{\mathcal{B}}},{\widehat{\sigma}}_{t,i,\mathrm{\mathcal{B}}}^2\right) $$

(20)

$$ {c}_{i,\mathcal{A}}=\sum_{j\in \mathcal{A}}\left({c}_{i,j}\right)\sim \mathrm{NB}\left({\widehat{\mu}}_{c,i,\mathcal{A}},{\widehat{\sigma}}_{c,i,\mathcal{A}}^2\right) $$

(21)

$$ {c}_{i,\mathrm{\mathcal{B}}}=\sum_{j\in \mathrm{\mathcal{B}}}\left({c}_{i,j}\right)\sim \mathrm{NB}\left({\widehat{\mu}}_{c,i,\mathrm{\mathcal{B}}},{\widehat{\sigma}}_{c,i,\mathrm{\mathcal{B}}}^2\right) $$

(22)

It is not difficult to calculate the distribution parameters in Eqs. (19), (20), (21) and (20). Taking \( {t}_{i,\mathcal{A}} \)for example, we have

$$ {\widehat{\mu}}_{t,i,\mathcal{A}}={\widehat{p}}_{i,\mathcal{O}}{\widehat{q}}_i{\widehat{e}}_{i,\mathcal{A}}\sum_{j\in \mathcal{A}}{\widehat{s}}_{t,j} $$

(23)

$$ {\widehat{\sigma}}_{t,i,\mathcal{A}}^2={\widehat{p}}_{i,\mathcal{O}}{\widehat{q}}_i{\widehat{e}}_{i,\mathcal{A}}\sum_{j\in \mathcal{A}}{s}_{t,j}+{\upsilon}_{\mathcal{A}}\left({\widehat{p}}_{i,\mathcal{O}},{\widehat{q}}_i\right){\widehat{e}}_{i,\mathcal{A}}^2\sum_{j\in \mathcal{A}}{\widehat{s}}_{t,j}^2 $$

(24)

Given the total number of methylation read count \( \left({t}_i={t}_{i,\mathcal{A}}+{t}_{i,\mathrm{\mathcal{B}}}\right) \) and the total number of reads under each condition \( \left({n}_{i,\mathcal{A}}={t}_{i,\mathcal{A}}+{c}_{i,\mathcal{A}}\right) \) and (*n*
_{
i , ℬ} = *t*
_{
i , ℬ} + *c*
_{
i , ℬ}) do not change, the joint conditional probability of the observation \( \left({t}_{i,\mathcal{A}}=t\right) \) can be calculated with

$$ {\displaystyle \begin{array}{c}P\left({t}_{i,\mathcal{A}}=t|{t}_i,{n}_{i,\mathcal{A}},{n}_{i,\mathrm{\mathcal{B}}}\right)=P\left({t}_{i,\mathcal{A}}=t\right)P\left({t}_{i,\mathrm{\mathcal{B}}}={t}_i-t\right)\\ {}P\left({c}_{i,\mathcal{A}}={n}_{i,\mathcal{A}}-t\right)P\left({c}_{i,\mathrm{\mathcal{B}}}={n}_{i,\mathrm{\mathcal{B}}}-{t}_i+t\right)\end{array}} $$

(25)

whose components are previously defined in Eqs. (19), (20), (21) and (22).

Please note that, the over-dispersion of reads counts in input control samples are also modeled and covered in the QNB test, making it substantially different from the DESeq, DRME or ChIPComp. The QNB test essentially covers all the 4 samples with 4 cross-linked binomial distributions; while in DRME model, the input control samples are used only for gene expression estimation, so the statistical test covers the IP samples only with 2 negative binomial distributions. The inclusion of input control samples in the test, rather than simply using it as a background, makes a major contribution to the performance improvement, and also makes QNB substantially different from all other count-based (negative-binomial distribution-based) approaches such as DRME, edgeR, DESeq and ChIPComp.

The statistical significance of an observation can then be calculated using a two-sided test

$$ p\hbox{-} \mathrm{value}=\frac{\sum_{t:P(t)\le P\left({t}_{i,\mathcal{A}}\right)}P(t)}{\sum_{\forall t}P(t)} $$

(26)

Besides the *p*-value that quantifies the statistical significance, the risk ratio (RR) of RNA methylation level, which quantifies the degree of differential methylation, can also be calculated based on Eq. (8), with

$$ {\mathrm{RR}}_i={\widehat{p}}_{i,\mathcal{A}}/{\widehat{p}}_{i,\mathrm{\mathcal{B}}} $$

(27)

where conditionℬ is considered as the control group in a case-control study and \( \mathcal{A} \)
as the treated group. Please note that, the percentage of methylation under an experimental condition
\( {p}_{i,\mathcal{A}} \)
denotes a normalized degree of methylation observed on the data rather than the true percentage of methylation in biological sense. However, it still provides a good evaluation of the relative methylation level. Similar to the methylation risk ratio (RR), the odds ratio (OR) of RNA methylation, which also quantifies the degree of differential RNA methylation, can be calculated after compensating the sample sequencing depth

$$ {\mathrm{OR}}_i=\left\{\frac{\sum_{j\in \mathcal{A}}\left({t}_{i,j}/{s}_{t,j}\right)}{\sum_{j\in \mathcal{A}}\left({c}_{i,j}/{s}_{c,j}\right)}\right\}/\left\{\frac{\sum_{j\in \mathrm{\mathcal{B}}}\left({t}_{i,j}/{s}_{t,j}\right)}{\sum_{j\in \mathrm{\mathcal{B}}}\left({c}_{i,j}/{s}_{c,j}\right)}\right\} $$

(28)