Quantifying homeolog expression bias (HEB)
We will write A and B to denote a homeologous gene pair from which RNA-seq data is generated in n biological replicates. Typically, the mean expression levels of the homeologs (denoted \(\bar {a}\) and \(\bar {b}\)) are normalized by gene length and sequencing depth, as when reported in units of RPKM (reads per kilobase of coding sequence per million mapped reads). We define the homeolog expression bias (HEB) of the n replicates as
$$\text{HEB} = \log (\bar{b} / \bar{a}) = \log \bar{b} - \log \bar{a} \,, $$
a dimensionless quantity with HEB=0 indicating no bias. If one uses the base 2 logarithm, HEB=−3 indicates 8-fold bias towards homeolog A.
Likelihood ratio test for HEB
After accounting for the possibility of different gene lengths, the statistical test for HEB is essentially a likelihood ratio test for differential expression of a pair of homeologous genes. The goal is to determine whether there is sufficient evidence to reject the null hypothesis (H0) that there is no bias (i.e., equal expression levels for homeologous genes) in favor of the alternative hypothesis (H1) that bias is present, i.e., different expression levels for homeologous genes. In mathematical terms, the null hypothesis H0 corresponds to the parameters (denoted by θ) of a probability model for generating the data being in a specified subset Θ0 of the parameter space Θ, that is,
$$\begin{array}{@{}rcl@{}} && H_{0} \colon \; \theta \in \Theta_{0} \\ && H_{1} \colon \; \theta \in \Theta \backslash \Theta_{0} \,. \end{array} $$
Let θ=(λa,λb) denote the true but unknown expression levels (physical units of length −1, e.g., RPKM). Assuming positive, i.e. non-zero, expression, the parameter space is \(\Theta = \left \{\theta : {{\lambda }^{a}},{{\lambda }^{b}} \in \mathbb {R}_{+} \right \}\). The null (H0) and alternative (H1) hypotheses for the likelihood ratio test for homeolog expression bias are formalized as follows,
$$\begin{array}{@{}rcl@{}} && H_{0} \colon \; \left({{\lambda}^{a}}, {{\lambda}^{b}}\right) \in \left\{{{\lambda}^{a}},{{\lambda}^{b}} \in \mathbb{R}_{+} : {{\lambda}^{a}} = {{\lambda}^{b}}\right\} \\ && H_{1} \colon \; \left({{\lambda}^{a}}, {{\lambda}^{b}}\right) \in \left\{{{\lambda}^{a}},{{\lambda}^{b}} \in \mathbb{R}_{+} : {{\lambda}^{a}} \neq {{\lambda}^{b}} \right\} \,. \end{array} $$
Equivalently, let ω=λb/λa denote the ratio of expression levels and drop the superscript indicating the reference homeolog (λ=λa). In that case, λb=ωλ and the hypotheses are written as follows,
$$\begin{array}{@{}rcl@{}} H_{0} \colon && ({\lambda}, {\omega}) \in \{ {\lambda}, {\omega} \in \mathbb{R}_{+} : {\omega} = 1 \} \\ H_{1} \colon && ({\lambda}, {\omega}) \in \{ {\lambda}, {\omega} \in \mathbb{R}_{+} : {\omega} \neq 1 \} \,. \end{array} $$
Once we specify a probability model for the data \(\mathcal {X}\), likelihood functions for each hypothesis, \(\mathcal {L}_{0}(\theta | \mathcal {X})\) and \(\mathcal {L}_{1}(\theta | \mathcal {X})\), can be derived (see next section). For composite hypotheses, the appropriate likelihood ratio test statistic is
$$ W(\mathcal{X})= -2 \ln \frac{ \hat{\mathcal{L}}_{0}}{ \hat{\mathcal{L}}_{1}} = 2 \left(\ln \hat{\mathcal{L}}_{1} - \ln \hat{\mathcal{L}}_{0} \right) \,, $$
(1)
where \(\hat {\mathcal {L}}_{1} \) and \(\hat {\mathcal {L}}_{0}\) are the maximized likelihoods,
$$\begin{array}{@{}rcl@{}} \hat{\mathcal{L}}_{1} &= & \sup\{\,\mathcal{L}(\theta | \mathcal{X}) : \theta\in\Theta\,\} \\ \hat{\mathcal{L}}_{0} &=& \sup\{\mathcal{L}(\theta | \mathcal{X}):\theta\in\Theta_{0}\,\} \,. \end{array} $$
A critical value of the test statistic (W∗) is obtained from the Chi-squared distribution with significance level α=0.05. The number of degrees of freedom δ is the difference in the number of free parameters in Θ and Θ0 (here δ=1) [36]. The null hypothesis H0 is rejected in favor of the alternative H1 when \(W(\mathcal {X}) > W_{*}\).
Probability model for RNA-seq read counts
Denote the lengths of homeologous genes a and b as ℓa and ℓb (e.g., in kilobases) and let di be the sequencing depth (e.g., in millions of mapped reads) of replicate i. The expected number of RNA-seq reads for gene a and replicate i is
$$ {{\mu}^{a}_{i}} = {{\lambda}^{a}} {{\ell}^{a}} {{d}_{i}} = {\lambda} {{\ell}^{a}} {{d}_{i}} \,, $$
(2)
where in the second equality we have dropped the superscript for the reference homeolog (λ=λa). Similarly, the expected number of RNA-seq reads for gene b and replicate i is
$$ {{\mu}^{b}_{i}} = {{\lambda}^{b}} {{\ell}^{b}} {{d}_{i}} = {\omega} {\lambda} {{\ell}^{b}} {{d}_{i}} \, $$
(3)
where ω=λb/λa=λb/λ.
In order to model the overdispersion commonly observed in RNA-seq data, the probability model assumes that the count data for each gene is drawn from a negative binomial distribution,
$$ f(x ; {\mu},r) = \frac{\Gamma(r+x)}{\Gamma(r)x!} \left(\frac{{\mu}}{{\mu}+r}\right)^{x} \left(\frac{r}{{\mu}+r}\right)^{r}\,, $$
where μ is the appropriate mean (\({{\mu }^{a}_{i}}\) or \({{\mu }^{b}_{i}}\) in Eqs. 2 and 3). That is, if \(X^{a}_{i}\) and \(X^{b}_{i}\) are random variables representing the count data for replicate i of homeologous genes A and B,
$$\begin{array}{@{}rcl@{}} \Pr\left\{X^{a}_{i} = {{a}_{i}}\right\} &=& f\left({{a}_{i}} ; {\lambda} {{\ell}^{a}} {{d}_{i}},{{r}_{i}}\right) \\ \Pr\left\{X^{b}_{i} = {{b}_{i}}\right\} &=& f\left({{b}_{i}} ; {\omega} {\lambda} {{\ell}^{b}} {{d}_{i}},{{r}_{i}}\right) \,, \end{array} $$
where we have used \({{\mu }^{a}_{i}} = {\lambda } {{\ell }^{a}} {{d}_{i}}\) and \({{\mu }^{b}_{i}} = {\omega } {\lambda } {{\ell }^{b}} {{d}_{i}}\). In these expressions, the aggregation parameter ri is obtained from the observed mean-variance relation for all homeolog pairs of the ith experimental replicate (see Appendix 1).
Assuming independence of experimental replicates, the likelihood functions \(\mathcal {L}_{1}\) and \(\mathcal {L}_{0}\) are products of the likelihood functions for each observation, that is,
$$ \mathcal{L}_{1}(\mathcal{X}) = \textstyle \prod_{i=1}^{n} \mathcal{L}_{1}^{i} (\mathcal{X}) \,, $$
and similarly for \(\mathcal {L}_{0}(\mathcal {X})\), where \(\mathcal {X}_{i} = \{{{a}_{i}},{{b}_{i}}\}\) indicates the observed read counts for replicate i and \(\mathcal {X} = \cup _{i=1}^{n} \mathcal {X}_{i}\). The likelihood function for the alternative hypothesis and the ith replicate is
$$\begin{array}{@{}rcl@{}} && \mathcal{L}_{1}^{i}(\mathcal{X})= \frac{\Gamma({{r}_{i}}+{{a}_{i}})}{\Gamma({{r}_{i}}){{a}_{i}}!}\frac{\Gamma({{r}_{i}}+{{b}_{i}})}{\Gamma({{r}_{i}}){{b}_{i}}!} \\ &&\times \left(\frac{{\lambda} {{\ell}^{a}} {{d}_{i}}}{{\lambda} {{\ell}^{a}} {{d}_{i}}+{{r}_{i}}}\right)^{{{a}_{i}}} \left(\frac{{\omega} {\lambda} {{\ell}^{b}} {{d}_{i}}}{{\omega} {\lambda} {{\ell}^{b}} {{d}_{i}}+{{r}_{i}}}\right)^{{{b}_{i}}} \\ && \times \left(\frac{{{r}_{i}}}{{\lambda} {{\ell}^{a}} {{d}_{i}} +{{r}_{i}}}\right)^{{{r}_{i}}} \left(\frac{{{r}_{i}}}{{\omega} {\lambda} {{\ell}^{b}} {{d}_{i}} + {{r}_{i}}}\right)^{{{r}_{i}}} \, . \end{array} $$
(4)
The likelihood function for the null hypothesis and the ith replicate, \(\mathcal {L}_{0}^{i}(\mathcal {X})\), is given by Eq. 4 with ω=1.
Maximum likelihood estimation
Maximum likelihood estimation is performed using the the log-likelihood function corresponding to Eq. 4, namely,
$$ {\ln \mathcal{L}}_{1}(\mathcal{X}) = \textstyle \sum_{i} {\ln \mathcal{L}}_{1}^{i} (\mathcal{X}) \,, $$
(5)
where
$$\begin{array}{@{}rcl@{}} {\ln \mathcal{L}}_{1}^{i}(\mathcal{X}) &=& \gamma({{r}_{i}}+{{a}_{i}}) + \ln\left({{a}_{i}}!\right) \\ &+&\gamma\left({{r}_{i}}+{{b}_{i}}\right) + \ln\left({{b}_{i}}!\right) \\ &+& 2 {{r}_{i}} \ln {{r}_{i}} - 2 \gamma\left({{r}_{i}}\right) \\ &+& {{a}_{i}} \ln\left({\lambda} {{\ell}^{a}} {{d}_{i}} \right) +{{b}_{i}} \ln\left({\omega} {\lambda} {{\ell}^{b}} {{d}_{i}} \right) \\ & -& \left({{a}_{i}} + {{r}_{i}} \right) \ln \left({\lambda} {{\ell}^{a}} {{d}_{i}} + {{r}_{i}}\right) \\ & -& \left. \left({{b}_{i}} + {{r}_{i}} \right) \ln \left({\omega} {\lambda} {{\ell}^{b}} {{d}_{i}} + {{r}_{i}}\right) \right] \end{array} $$
(6)
and γ(·)= lnΓ(·). The log-likelihood function for the null hypothesis (\({\ln \mathcal {L}}_{0}\)) is given by Eq. 6 with ω=1.
The log-likelihood function \({\ln \mathcal {L}}_{1}(\mathcal {X})\) is maximized by numerically solving for \(\hat {\lambda }\) and \(\hat {\omega }\) leading to zero partial derivatives,
$$\begin{array}{@{}rcl@{}} 0 &=& \left. \frac{{\partial \ln \mathcal{L}}_{1}}{\partial \lambda} \right|_{\hat{\lambda},\hat{\omega}} \end{array} $$
(7)
$$\begin{array}{@{}rcl@{}} 0 &=& \left.\frac{{\partial \ln \mathcal{L}}_{1}}{\partial \omega}\right|_{\hat{\lambda},\hat{\omega}} \,, \end{array} $$
(8)
as described in Appendix 2. The log-likelihood function \({\ln \mathcal {L}}_{0}(\mathcal {X})\) is maximized by solving for \(\hat {\lambda }\) leading to
$$ 0 = \left. \frac{{\partial \ln \mathcal{L}}_{0}}{\partial \lambda} \right|_{\hat{\lambda}} \,. $$
(9)
The optimal parameter values \(\hat {\lambda }\) and \(\hat {\omega }\) are used to evaluate \(\ln \hat {\mathcal {L}}_{0} (\mathcal {X} ; \hat {\lambda })\), \(\ln \hat {\mathcal {L}}_{1}(\mathcal {X} ; \hat {\lambda },\hat {\omega })\), and the test statistic W (see Eq. 1).
Quantifying changes in homeolog expression bias (ΔHEB)
Let A and B represent homeologous genes and RNA-seq data is generated under conditions 1 and 2 in n biological replicates, leading to mean expression levels \({\bar {a}_{1}}, {\bar {a}_{2}}, {\bar {b}_{1}}, {\bar {b}_{2}}\). The change in homeolog expression bias (ΔHEB) is defined as
$$ {\Delta {\text{HEB}}} = \text{HEB}_{2} - \text{HEB}_{1} = \log \left(\frac{{\bar{b}_{2}}/{\bar{a}_{2}}}{{\bar{b}_{1}} / {\bar{a}_{1}}}\right) \,, $$
(10)
where the last equality uses \(\text {HEB}_{1} = \log {\bar {b}_{1}}/{\bar {a}_{1}}\) and \(\text {HEB}_{2}= \log {\bar {b}_{2}}/{\bar {a}_{2}}\).
Likelihood ratio test for ΔHEB
The likelihood ratio test for ΔHEB is designed to determine whether there is sufficient evidence to reject the null hypothesis (H0) that homeolog expression bias is the same under two experimental conditions (ΔHEB=0) in favor of the alternative hypothesis (H1) that there is a difference in bias (ΔHEB≠0). Following notation similar to the previous section, our hypotheses are
$$\begin{array}{@{}rcl@{}} && H_{0} \colon \theta \in \Theta_{0} = \left\{\lambda^{a|b}_{1|2} \in \mathbb{R}_{+} : {{\lambda}^{b}_{1}}/{{\lambda}^{a}_{1}} = {{\lambda}^{b}_{2}}/{{\lambda}^{a}_{2}}\right\} \\ && H_{1} \colon \theta \in \Theta \backslash \Theta_{0} =\left\{\lambda^{a|b}_{1|2} \in \mathbb{R}_{+} : {{\lambda}^{b}_{1}}/{{\lambda}^{a}_{1}} \neq {{\lambda}^{b}_{2}}/{{\lambda}^{a}_{2}}\right\} \,, \end{array} $$
where \(\lambda ^{a|b}_{1|2}\) is an abbreviation for \({{\lambda }^{a}_{1}},{{\lambda }^{b}_{1}}, {{\lambda }^{b}_{1}},{{\lambda }^{b}_{2}}\). Equivalently,
$$\begin{array}{@{}rcl@{}} H_{0} \colon & \theta \in \Theta_{0} = \{ {{\lambda}_{1|2}}, {{\omega}_{1|2}} \in \mathbb{R}_{+} : {{\omega}_{1}} = {{\omega}_{2}} \} \\ H_{1} \colon & \theta \in \Theta\backslash \Theta_{0} = \{ {{\lambda}_{1|2}}, {{\omega}_{1|2}} \in \mathbb{R}_{+} : {{\omega}_{1}} \ne {{\omega}_{2}} \} \,, \end{array} $$
where \({{\omega }_{1}} = {{\lambda }^{b}_{1}}/{{\lambda }^{a}_{1}}, {{\omega }_{2}} = {{\lambda }^{b}_{2}}/{{\lambda }^{a}_{2}}, {{\lambda }_{1}} = {{\lambda }^{a}_{1}}\) and \({{\lambda }_{2}} = {{\lambda }^{a}_{2}}\). The difference in degrees of freedom of the alternative and null hypotheses is δ=4−3=1.
The likelihood functions for the ΔHEB test are similar to those for HEB, though the two different experimental conditions lead to twice as many terms (cf. Eq. 4). The likelihood function for H1 is
$$ \mathcal{L}_{1}(\mathcal{X})= \prod_{k=1}^{2} \prod_{i=1}^{n} \mathcal{L}_{1}^{k,i}(\mathcal{X}) $$
(11)
where \(\mathcal {L}_{1}^{k,i}\), the likelihood function for the ith replicate of the kth condition, has the form of Eq. 4 with parameters indexed by condition \(\left (a_{k,i}, b_{k,i}, r_{k,i}^{a}, r_{k,i}^{b}, {\omega }_{k}\right)\). The log-likelihood function for H1 is thus
$$ {\ln \mathcal{L}}_{1}(\mathcal{X}) = \sum_{k=1}^{2} \sum_{i=1}^{n} {\ln \mathcal{L}}_{1}^{k,i} (\mathcal{X}) $$
(12)
where
$$\begin{array}{@{}rcl@{}} {\ln \mathcal{L}}_{1}^{k,i}(\mathcal{X}) &=& \gamma\left({{r}_{k,i}}+{{a}_{k,i}}\right) + \ln\left({{a}_{k,i}}!\right) \\ &+&\gamma\left({{r}_{k,i}}+{{b}_{k,i}}\right) + \ln\left({{b}_{k,i}}!\right) \\ &+& 2 {{r}_{k,i}} \ln {{r}_{k,i}} - 2 \gamma\left({{r}_{k,i}}\right) \\ &+& {{a}_{k,i}} \ln\left({\lambda} {{\ell}^{a}} {{d}_{i}} \right) +{{b}_{k,i}} \ln\left({\omega}_{k} {\lambda} {{\ell}^{b}} {{d}_{i}} \right) \\ & -& \left({{a}_{k,i}} + {{r}_{k,i}} \right) \ln \left({\lambda} {{\ell}^{a}} {{d}_{i}} + {{r}_{k,i}}\right) \\ & -& \left. \left({{b}_{k,i}} + {{r}_{i}} \right) \ln \left({\omega}_{k} {\lambda} {{\ell}^{b}} {{d}_{i}} + {{r}_{k,i}}\right) \right] \end{array} $$
(13)
and γ(·)= lnΓ(·). The log-likelihood function for the null hypothesis (\({\ln \mathcal {L}}_{0}\)) is given by the above expressions with ω1=ω2=ω. The aggregation parameters (rk,i) are determined from the data with experimental conditions k=1 and 2 considered separately (cf. Eqs. 17–19).
The log-likelihood function \({\ln \mathcal {L}}_{1}(\mathcal {X})\) used in the analysis of ΔHEB is maximized by numerically solving uncoupled systems of the form of Eqs. 7 and 8 for \(\left (\hat {\lambda }_{1},\hat {\omega }_{1}\right)\) and \(\left (\hat {\lambda }_{2},\hat {\omega }_{2}\right)\). The log-likelihood function \({\ln \mathcal {L}}_{0}(\mathcal {X})\) is maximized by solving for \(\hat {\lambda }_{1}, \hat {\lambda }_{2}\) and \(\hat {\omega }\) that lead to zero partial derivatives,
$$\begin{array}{@{}rcl@{}} 0 &=& \left. \frac{{\partial \ln \mathcal{L}}_{0}}{\partial {{\lambda}_{1}} } \right|_{\hat{\lambda}_{1},\hat{\lambda}_{2},\hat{\omega}} \end{array} $$
(14)
$$\begin{array}{@{}rcl@{}} 0 &=& \left. \frac{{\partial \ln \mathcal{L}}_{0}}{\partial {{\lambda}_{2}}} \right|_{\hat{\lambda}_{1},\hat{\lambda}_{2},\hat{\omega}} \end{array} $$
(15)
$$\begin{array}{@{}rcl@{}} 0 &=& \left. \frac{\partial \ln \mathcal{L}_{0}}{\partial{\omega}} \right|_{\hat{\lambda}_{1},\hat{\lambda}_{2},\hat{\omega}} \,. \end{array} $$
(16)
The optimal parameter values are used to evaluate the likelihoods, \(\hat {\mathcal {L}}_{0} (\mathcal {X} ; \hat {\lambda }_{1},\hat {\lambda }_{2},\hat {\omega })\) and \(\hat {\mathcal {L}}_{1}(\mathcal {X} ; \hat {\lambda }_{1},\hat {\lambda }_{2},\hat {\omega }_{1}, \hat {\omega }_{2})\), and the test statistic W (see Eq. 1).
The numerical solution of these equations was facilitated by transforming these equations in a manner that ensured both parameters are positive and symmetric with respect to the mean expression levels of homeolog A and B (see Appendix 2).