 Methodology article
 Open Access
 Published:
A likelihood ratio test for changes in homeolog expression bias
BMC Bioinformatics volume 20, Article number: 149 (2019)
Abstract
Background
Gene duplications are a major source of raw material for evolution and a likely contributor to the diversity of life on earth. Duplicate genes (i.e., homeologs, in the case of a whole genome duplication) may retain their ancestral function, sub or neofunctionalize, or be lost entirely. A primary way that duplicate genes evolve new functions is by altering their expression patterns. Comparing the expression patterns of duplicate genes gives clues as to whether any of these evolutionary processes have occurred.
Results
We develop a likelihood ratio test for the analysis of the expression ratios of duplicate genes across two conditions (e.g., tissues). We demonstrate an application of this test by comparing homeolog expression patterns of 1448 homeologous gene pairs using RNAseq data generated from leaves and petals of an allotetraploid monkeyflower (Mimulus luteus). We assess the sensitivity of this test to different levels of homeolog expression bias and compare the method to several alternatives.
Conclusions
The likelihood ratio test derived here is a direct, transparent, and easily implemented method for detecting changes in homeolog expression bias that outperforms alternative approaches. While our method was derived with homeolog analysis in mind, this method can be used to analyze changes in the ratio of expression levels between any two genes in any two conditions.
Background
Gene duplications are a major source of raw material for evolution and a likely contributor to the diversity of life on earth [1–9]. Gene duplications are a special type of mutation resulting in the multiplication of intact functional components. These duplicate genes may either retain the ancestral function or individual portions of the gene’s ancestral function may be partitioned (i.e., subfunctionalize) or evolve new functions entirely (i.e., neofunctionalize) [10–12]. Duplicate genes may evolve new functions either by changes in the primary coding sequence or altering where and when they are expressed. Previous work has indicated that changes to gene expression and their regulatory networks may be more important, rapid, or flexible than divergence of protein identities in the evolution of sub and neofunctionlization [13–19].
There are multiple scenarios in which genes can be duplicated, ranging from small regional gene duplications to massive whole genome duplications (WGDs). The term polyploid refers to cells or organisms that have undergone a WGD event and contain more than two paired sets of chromosomes. Each complete set of chromosomes is referred to as a subgenome. Homologous genes located on separate subgenomes are referred to as homeologs.
WGDs are especially common in plants; indeed, all extant angiosperms (i.e., flowering plants) have at least two rounds of WGD in common [20], and up to 15% of speciation events in angiosperms may have been the product of WGDs [21]. Importantly, many major crops (corn, potato, wheat, etc.) are polyploid [22]. WGD events and the resulting polyploidy are not restricted to plants, but have occurred in both vertebrate and invertebrate lineages as well. For example, the African clawed frog, Xenopus, commonly used as an experimental model system and extensively studied in developmental biology, includes species ranging from diploid to dodecaploid [23]. Other examples of polyploids with ancient WGD events include the zebrafish Danio rerio [24], several salmonids [2], and some species of fungi [25]. Interestingly, there exists at least one polyploid mammal [26], a tetraploid rat from Argentina that mediates gene dosage by regulation of ribosomal RNA.
The biological consequences of gene duplications and subfunctionalization are significant and include examples such as the evolution of eyes [27], the evolution of hemoglobins [28], development of heat resistance in plants [29], and insecticide resistance [30]. Given the importance of duplicate genes in evolution, it is natural to ask how we might quantify differences in the activity or function of homeologous genes. One way to begin exploring this question is by analyzing gene expression levels.
Genomewide gene expression levels are commonly quantified using high throughput RNA sequencing (RNAseq) [31]. In RNAseq experiments, mRNA is extracted, purified, and reverse transcribed into cDNA. This cDNA is fragmented into smaller pieces and sequenced using nextgeneration technology. The resulting millions of sequence reads are then mapped to either a reference genome or reference transcriptome, and the number of sequences mapping to a particular gene is used as an indication of the expression level of that gene.
In differential expression analysis, highthroughput RNAseq data is used to determine if gene expression levels vary under different experimental conditions, or in distinct tissues, etc. Several different approaches to this statistical analysis exist [32–34], some of which use methods based on maximum likelihood estimation and likelihood ratio tests.
Homeologous gene pairs frequently have distinguishing sequence differences. Therefore, sequencing reads derived from individual homeologs can be distinguished and expression levels can be determined for each homeolog. The term homeolog expression bias (HEB) refers to cases where homeologs are expressed at unequal levels in a single experimental condition [35]. The primary objective of this paper, development of a likelihood ratio test for statistical analysis of changes in homeolog expression bias (denoted ΔHEB) is a nontrivial extension of the statistical analysis of differential expression.
The following sections begin with the derivation of a likelihood ratio test for HEB. This is our starting point for the development of a likelihood ratio test for ΔHEB, i.e. changes in relative expression levels between homeologous genes in two conditions. We apply this method to RNAseq data of homeologous gene expression in petals and leaves of the allotetraploid Mimulus luteus. Finally, using simulated data, we show that the likelihood ratio test for ΔHEB derived here is the best choice among several alternative methods.
Methods
Quantifying homeolog expression bias (HEB)
We will write A and B to denote a homeologous gene pair from which RNAseq data is generated in n biological replicates. Typically, the mean expression levels of the homeologs (denoted \(\bar {a}\) and \(\bar {b}\)) are normalized by gene length and sequencing depth, as when reported in units of RPKM (reads per kilobase of coding sequence per million mapped reads). We define the homeolog expression bias (HEB) of the n replicates as
a dimensionless quantity with HEB=0 indicating no bias. If one uses the base 2 logarithm, HEB=−3 indicates 8fold bias towards homeolog A.
Likelihood ratio test for HEB
After accounting for the possibility of different gene lengths, the statistical test for HEB is essentially a likelihood ratio test for differential expression of a pair of homeologous genes. The goal is to determine whether there is sufficient evidence to reject the null hypothesis (H_{0}) that there is no bias (i.e., equal expression levels for homeologous genes) in favor of the alternative hypothesis (H_{1}) that bias is present, i.e., different expression levels for homeologous genes. In mathematical terms, the null hypothesis H_{0} corresponds to the parameters (denoted by θ) of a probability model for generating the data being in a specified subset Θ_{0} of the parameter space Θ, that is,
Let θ=(λ^{a},λ^{b}) denote the true but unknown expression levels (physical units of length ^{−1}, e.g., RPKM). Assuming positive, i.e. nonzero, expression, the parameter space is \(\Theta = \left \{\theta : {{\lambda }^{a}},{{\lambda }^{b}} \in \mathbb {R}_{+} \right \}\). The null (H_{0}) and alternative (H_{1}) hypotheses for the likelihood ratio test for homeolog expression bias are formalized as follows,
Equivalently, let ω=λ^{b}/λ^{a} denote the ratio of expression levels and drop the superscript indicating the reference homeolog (λ=λ^{a}). In that case, λ^{b}=ωλ and the hypotheses are written as follows,
Once we specify a probability model for the data \(\mathcal {X}\), likelihood functions for each hypothesis, \(\mathcal {L}_{0}(\theta  \mathcal {X})\) and \(\mathcal {L}_{1}(\theta  \mathcal {X})\), can be derived (see next section). For composite hypotheses, the appropriate likelihood ratio test statistic is
where \(\hat {\mathcal {L}}_{1} \) and \(\hat {\mathcal {L}}_{0}\) are the maximized likelihoods,
A critical value of the test statistic (W_{∗}) is obtained from the Chisquared distribution with significance level α=0.05. The number of degrees of freedom δ is the difference in the number of free parameters in Θ and Θ_{0} (here δ=1) [36]. The null hypothesis H_{0} is rejected in favor of the alternative H_{1} when \(W(\mathcal {X}) > W_{*}\).
Probability model for RNAseq read counts
Denote the lengths of homeologous genes a and b as ℓ^{a} and ℓ^{b} (e.g., in kilobases) and let d_{i} be the sequencing depth (e.g., in millions of mapped reads) of replicate i. The expected number of RNAseq reads for gene a and replicate i is
where in the second equality we have dropped the superscript for the reference homeolog (λ=λ^{a}). Similarly, the expected number of RNAseq reads for gene b and replicate i is
where ω=λ^{b}/λ^{a}=λ^{b}/λ.
In order to model the overdispersion commonly observed in RNAseq data, the probability model assumes that the count data for each gene is drawn from a negative binomial distribution,
where μ is the appropriate mean (\({{\mu }^{a}_{i}}\) or \({{\mu }^{b}_{i}}\) in Eqs. 2 and 3). That is, if \(X^{a}_{i}\) and \(X^{b}_{i}\) are random variables representing the count data for replicate i of homeologous genes A and B,
where we have used \({{\mu }^{a}_{i}} = {\lambda } {{\ell }^{a}} {{d}_{i}}\) and \({{\mu }^{b}_{i}} = {\omega } {\lambda } {{\ell }^{b}} {{d}_{i}}\). In these expressions, the aggregation parameter r_{i} is obtained from the observed meanvariance relation for all homeolog pairs of the ith experimental replicate (see Appendix 1).
Assuming independence of experimental replicates, the likelihood functions \(\mathcal {L}_{1}\) and \(\mathcal {L}_{0}\) are products of the likelihood functions for each observation, that is,
and similarly for \(\mathcal {L}_{0}(\mathcal {X})\), where \(\mathcal {X}_{i} = \{{{a}_{i}},{{b}_{i}}\}\) indicates the observed read counts for replicate i and \(\mathcal {X} = \cup _{i=1}^{n} \mathcal {X}_{i}\). The likelihood function for the alternative hypothesis and the ith replicate is
The likelihood function for the null hypothesis and the ith replicate, \(\mathcal {L}_{0}^{i}(\mathcal {X})\), is given by Eq. 4 with ω=1.
Maximum likelihood estimation
Maximum likelihood estimation is performed using the the loglikelihood function corresponding to Eq. 4, namely,
where
and γ(·)= lnΓ(·). The loglikelihood function for the null hypothesis (\({\ln \mathcal {L}}_{0}\)) is given by Eq. 6 with ω=1.
The loglikelihood function \({\ln \mathcal {L}}_{1}(\mathcal {X})\) is maximized by numerically solving for \(\hat {\lambda }\) and \(\hat {\omega }\) leading to zero partial derivatives,
as described in Appendix 2. The loglikelihood function \({\ln \mathcal {L}}_{0}(\mathcal {X})\) is maximized by solving for \(\hat {\lambda }\) leading to
The optimal parameter values \(\hat {\lambda }\) and \(\hat {\omega }\) are used to evaluate \(\ln \hat {\mathcal {L}}_{0} (\mathcal {X} ; \hat {\lambda })\), \(\ln \hat {\mathcal {L}}_{1}(\mathcal {X} ; \hat {\lambda },\hat {\omega })\), and the test statistic W (see Eq. 1).
Quantifying changes in homeolog expression bias (ΔHEB)
Let A and B represent homeologous genes and RNAseq data is generated under conditions 1 and 2 in n biological replicates, leading to mean expression levels \({\bar {a}_{1}}, {\bar {a}_{2}}, {\bar {b}_{1}}, {\bar {b}_{2}}\). The change in homeolog expression bias (ΔHEB) is defined as
where the last equality uses \(\text {HEB}_{1} = \log {\bar {b}_{1}}/{\bar {a}_{1}}\) and \(\text {HEB}_{2}= \log {\bar {b}_{2}}/{\bar {a}_{2}}\).
Likelihood ratio test for ΔHEB
The likelihood ratio test for ΔHEB is designed to determine whether there is sufficient evidence to reject the null hypothesis (H_{0}) that homeolog expression bias is the same under two experimental conditions (ΔHEB=0) in favor of the alternative hypothesis (H_{1}) that there is a difference in bias (ΔHEB≠0). Following notation similar to the previous section, our hypotheses are
where \(\lambda ^{ab}_{12}\) is an abbreviation for \({{\lambda }^{a}_{1}},{{\lambda }^{b}_{1}}, {{\lambda }^{b}_{1}},{{\lambda }^{b}_{2}}\). Equivalently,
where \({{\omega }_{1}} = {{\lambda }^{b}_{1}}/{{\lambda }^{a}_{1}}, {{\omega }_{2}} = {{\lambda }^{b}_{2}}/{{\lambda }^{a}_{2}}, {{\lambda }_{1}} = {{\lambda }^{a}_{1}}\) and \({{\lambda }_{2}} = {{\lambda }^{a}_{2}}\). The difference in degrees of freedom of the alternative and null hypotheses is δ=4−3=1.
The likelihood functions for the ΔHEB test are similar to those for HEB, though the two different experimental conditions lead to twice as many terms (cf. Eq. 4). The likelihood function for H_{1} is
where \(\mathcal {L}_{1}^{k,i}\), the likelihood function for the ith replicate of the kth condition, has the form of Eq. 4 with parameters indexed by condition \(\left (a_{k,i}, b_{k,i}, r_{k,i}^{a}, r_{k,i}^{b}, {\omega }_{k}\right)\). The loglikelihood function for H_{1} is thus
where
and γ(·)= lnΓ(·). The loglikelihood function for the null hypothesis (\({\ln \mathcal {L}}_{0}\)) is given by the above expressions with ω_{1}=ω_{2}=ω. The aggregation parameters (r_{k,i}) are determined from the data with experimental conditions k=1 and 2 considered separately (cf. Eqs. 17–19).
The loglikelihood function \({\ln \mathcal {L}}_{1}(\mathcal {X})\) used in the analysis of ΔHEB is maximized by numerically solving uncoupled systems of the form of Eqs. 7 and 8 for \(\left (\hat {\lambda }_{1},\hat {\omega }_{1}\right)\) and \(\left (\hat {\lambda }_{2},\hat {\omega }_{2}\right)\). The loglikelihood function \({\ln \mathcal {L}}_{0}(\mathcal {X})\) is maximized by solving for \(\hat {\lambda }_{1}, \hat {\lambda }_{2}\) and \(\hat {\omega }\) that lead to zero partial derivatives,
The optimal parameter values are used to evaluate the likelihoods, \(\hat {\mathcal {L}}_{0} (\mathcal {X} ; \hat {\lambda }_{1},\hat {\lambda }_{2},\hat {\omega })\) and \(\hat {\mathcal {L}}_{1}(\mathcal {X} ; \hat {\lambda }_{1},\hat {\lambda }_{2},\hat {\omega }_{1}, \hat {\omega }_{2})\), and the test statistic W (see Eq. 1).
The numerical solution of these equations was facilitated by transforming these equations in a manner that ensured both parameters are positive and symmetric with respect to the mean expression levels of homeolog A and B (see Appendix 2).
Results
The likelihood ratio test for HEB applied to allotetraploid Mimulus luteus
To demonstrate the application of the likelihood ratio test for HEB, five biological replicates of RNAseq data were generated from petals of the tetraploid Mimulus luteus (monkeyflower), and another five replicates were generated from the leaves (see Appendix 3 for details). We have chosen M. luteus because it is a tetraploid with two distinct subgenomes, denoted A and B (mean synonymous divergence is ∼ 11.1%; for details on genome assembly, see [37]). In this section, we use the likelihood ratio test for HEB to find homeologous gene pairs where one homeolog is expressed at significantly different levels than the other, one tissue at a time. In the section on ΔHEB we develop a likelihood ratio test to determine whether there is a significant difference in the bias between the two tissues.
Homeolog expression bias in Mimulus luteus petals
Figure 1 (top panel) shows the result of applying the likelihood ratio test for HEB to the petal data. There are 1853 homeologous gene pairs in M. luteus that can be identified as coming from separate subgenomes. Of these 1853 homoeologous pairs, 1560 were testable (measurable expression from each individual homeolog). Of testable pairs, a total of 676 gene pairs show significant bias (using a significance level of α=0.05, and applying the BenjaminiHochberg correction [38, 39] to account for multiple testing error). In the 334 pairs biased towards the A homeolog the mean HEB is − 2.49 (5.6fold change). In the 342 pairs biased towards the B homeolog, the mean HEB is 2.39 (5.2fold change).
These results may be indicative of a number of evolutionary processes. For example, one of the homeologs may have become sub or neofunctionalized in this tissue, or one of the homeologs may simply be losing its function.
Homeolog expression bias in Mimulus luteus leaves
Next, the likelihood ratio test for HEB was applied to the leaf data (results shown in Fig 1, bottom panel). Of 1853 homoeologous pairs, 1498 were testable and a total of 399 gene pairs show significant bias. In the 199 pairs biased towards the A homeolog the mean HEB is − 2.83 (7.1fold change). In the 200 pairs biased towards the B homeolog, the mean HEB is 2.80 (7.0fold change).
The likelihood ratio test for ΔHEB applied to allotetraploid Mimulus luteus
The likelihood ratio test for ΔHEB requires each homeolog to have at least one read in each condition. Returning to the leaf and petal data from the previous sections on HEB, this gives 1448 testable pairs. Figure 2 shows the results of the likelihood ratio test for ΔHEB. We find a total of 76 gene pairs show significant ΔHEB. Of these, 35 are more biased towards the A homeolog in the leaf than they are in the petal. The remaining 41 gene pairs are more biased towards the B homeolog in the leaf than in the petal.
Figure 3 shows a scatter plot of homeolog expression bias (HEB) in leaf and petal. Colored marks indicate gene pairs with statistically significant changes in homeolog expression bias (ΔHEB) (these points correspond to the colored bars in Fig. 2). Data points in the topleft and bottomright quadrants of Fig. 3 represent homeologous pairs where one homeolog is more highly expressed in one tissue and its partner is more highly expressed in the other tissue. The topright and bottomleft quadrants correspond to homeologous pairs where the difference in bias favors the same homeolog but has become more extreme. Finally, all of the marks that are colored blue or yellow show significant change in bias and are candidates for tissue specific sub or neofunctionalization.
Although the change in homeolog expression bias is defined by Eq. 10 as the logfold change in homeolog expression bias, the intercalation of significant (yellow and blue) and not significant (gray) ΔHEB in Fig. 3 makes it clear that statistical evidence for ΔHEB is not reducible to the difference between HEB_{leaf} and HEB_{petal} (the vertical or horizontal distance to the line of slope 1 where HEB_{leaf}=HEB_{petal}).
Whether or not ΔHEB is statistically significant also depends on differences in sequencing depths, mean expression levels (e.g., lowly expressed genes are more likely to be influenced by shot noise), and ratios of gene lengths. All of these factors are considered simultaneously in the likelihood ratio test presented here. Assessing the statistical significance of ΔHEB using sequential HEB analysis would almost certainly result in a different set of genes being called significant.
Validation of the likelihood ratio tests using simulated data
A natural question to ask about HEB and ΔHEB is, “How large does the change in expression levels between homeologs across conditions need to be before we can detect ΔHEB most of the time?” Unsurprisingly, this depends largely on the number of biological replicates.
To explore this question, we generated simulated data with one expression level fixed at a constant value, μ^{a}=100, and varied the other expression level, μ^{b}=2^{x}μ^{a}, with x∈[0,2] in steps of 0.1. For each value of x, we generated 10,000 sets of data from a negative binomial distribution for N=3,6,12 and 24 replicates. We fixed the parameter r=10 for simplicity; this is within the range of values typically observed in RNAseq data.
Figure 4 shows the results of the likelihood ratio test for HEB on this simulated data set. We find that a 4fold change is almost always detectable, regardless of the number of replicates. However, detecting a 2fold change at least 95% of the time requires at least 12 replicates.
To assess the sensitivity of ΔHEB to different levels of bias shift, we created a similar data set. This time, we set 3 of the expression levels equal \(\left (\mu ^{a}_{1} = \mu ^{b}_{1} = \mu ^{a}_{2} =100\right)\), and varied the fourth; \(\mu ^{b}_{2} = 2^{x}\mu ^{a}_{2}\), with x∈[0,2] in steps of 0.1. The aggregation parameter was again fixed at r=10. For each value of x, 10,000 sets of data were generated from a negative binomial distribution for N=3,6,12 and 24 replicates.
Figure 5 shows the results of the likelihood ratio test for ΔHEB on this simulated data set. The results are similar to those for HEB, with the test for ΔHEB being slightly less sensitive than the test for HEB. For ΔHEB, a 4fold change in bias is detected more than 95% of the time when N≥6. As with the test for HEB, the ability to detect smaller changes increases significantly with the number of replicates.
Discussion
Alternative methods
Our method is transparent, derived specifically for the analysis of ΔHEB, and requires a minimal number of assumptions; nevertheless we wished to investigate whether other methods could achieve similar results. Because we found only one method analogous to ours in the literature (Homeoroq, [40]), we developed three additional ad hoc methods. To compare these methods we generated simulated data sets and analyzed ROC curves. Each data set contained 10,000 gene pairs, half of which had ΔHEB fixed at a constant value (2, 8, and 16). Three replicates were generated from negative binomial distributions, and this was repeated 50 times for each value of ΔHEB (150 simulations total).
First, we took a naive approach and performed ttests and ztests on the ratio of log_{2}fold changes between conditions 1 and 2. Next, we ran DESeq2 and extracted the estimated shrunken log_{2}fold changes and their standard errors, and performed a ztest (we call this method ‘DEZ’). Unsurprisingly, the naive methods (standard t and z tests) underperformed the LRT, with area under the ROC curve (ROC area) typically less than the LRT by ≈0.05 to 0.36.
The LRT outperformed DEZ for ΔHEB=8 and 16 (Fig. 7, top region). For ΔHEB=2, both methods performed poorly with mean ROC area =0.58080 for the LRT, while DEZ came out slightly ahead with mean ROC area =0.58083 (not shown). In all cases, the Homeoroq method significantly underperformed both alternatives.
The test with largest ROC area is not necessarily the best choice, for example, when an ROC curve accumulates a small area for low FPR, and a large area for high FPR. To address this, we evaluated partial ROC area for false positive rates between 0 and 0.1, as researchers typically don’t accept FPR > 0.1. An example truncated ROC curve is shown in Fig. 6. By this metric, the LRT outperforms DEZ for ΔHEB=8 and 16, while for ΔHEB=2 both methods performed poorly, with DEZ marginally better (Fig. 7, bottom region). Homeoroq significantly underperformed both alternatives at all levels of ΔHEB.
Conclusion
Gene duplication and polyploidy are extremely important factors in generating the diversity of life on earth. As Ohno stated in his seminal work on gene duplication [1], “Natural selection merely modified while redundancy created” the raw materials necessary for the diversification of life on earth.
In this paper we have developed a robust statistical framework specifically designed for the comparison of duplicate gene expression patterns. Importantly, this technique is consistent and reproducible. Through analysis of simulated data we have shown that these methods perform well, especially given the small sample sizes typical of RNAseq experiments. We have shown that the ability to detect small differences in expression levels increases as a function of sample size, a fact that can be used to aid experimental design. Other authors have noted this in the context of traditional differential expression analysis and made similar recommendations [41–43]. Moreover, we demonstrate the usefulness of the likelihood ratio test for ΔHEB using homeolog expression RNAseq data derived from a polyploid plant. While we have developed this test for the purpose of analyzing changes in expression patterns of homeologous genes, we emphasize that the method is suitable for the expression analysis of any two genes (they need not be homeologs) across any two conditions.
Appendix 1: Estimation of aggregation parameters
Due to the typically small number of replicates in RNAseq experiments, accurate estimation of the aggregation parameter is not realistic on a genebygene basis [34, 44]. Instead, we use the meanvariance relation of a negative binomial distribution, namely,
to compute an aggregation parameter r for each experimental replicate, after rescaling to account for each replicates sequencing depth.
In brief, let \(x_{j}^{i}\) denote the count data for the jth pair of homeologous genes obtained for experimental replicate i∈{1,2,…,n}. For each of the n replicates, we produce an auxiliary data set \(\left (y_{k,j}^{i}\right)\) by rescaling the count data for all replicates as though each were obtained in an experiment with the sequencing depth of replicate k,
For each gene (j), we compute a scaled mean (μ_{k,j}) and variance \(\left (\sigma ^{2}_{k,j}\right)\) of \(y_{k,j}^{i}\) over replicates (i). To obtain the aggregation parameter r_{k}, we perform a nonlinear least squares fit of the observed meanvariance relation across all genes. That is, r_{k} minimizes the sum of squares error,
Appendix 2: Numerical scheme for maximum likelihood estimation
For the analysis of both HEB and ΔHEB, parameter values maximizing the likelihood functions \(\hat {\mathcal {L}}_{0}\) and \(\hat {\mathcal {L}}_{1}\) were obtained using the builtin MATLAB command fsolve applied to Eqs. 7–9 and 14–16. In both cases, the numerical procedure was facilitated by changing variables from (λ,ω) to (v,y) through
that is, v= lnλ+y and y=(lnω)/2. This ensures positivity of λ and ω and leads to a system of equations that is symmetric in λ^{a}⇔λ^{b}. The new variable v is the logarithm of the geometric mean of the expression levels λ^{a}=λ and λ^{b}=ωλ,
that is, λ^{a}=λ=e^{v−y} and λ^{b}=ωλ=e^{v+y}. The transformed partial derivatives used to maximize the loglikelihood \({\ln \mathcal {L}}_{1}\) (Eqs. 7–8) are
where
The transformed partial derivative used to maximize \({\ln \mathcal {L}}_{0}\) are found by substituting y=0 in Eq. 20,
For the analysis of ΔHEB, the partial derivatives used to maximize \({\ln \mathcal {L}}_{1}\) are two uncoupled systems of the form of Eq. 20–23, one for each experimental condition (k=1 and 2),
where
For the null hypothesis y_{2}=y_{1}=y we numerically solve a system of three equations, including
for k=1 and 2. These are coupled via
Appendix 3: Experimental methods
Plant tissues were collected from second generation inbred Mimulus luteus. All plants were grown in a greenhouse under a 16 h light regiment at 21°C and 30% humidity. Petal tissue was collected from the corolla of a flower bud near blooming, and leaf tissue came from young leaves adjacent to the stem apical meristem. Five replicates of each tissue type were collected, at the same time of day, from different individuals. Approximately 100–200 mg of plant tissue was immediately placed into liquid nitrogen. RNA was extracted by grinding frozen tissue with pestles in PureLink Plant RNA Reagent from Ambion. Column isolation of RNA was subsequently performed using Directzol RNA MiniPrep Plus Kit from Zymo Research. Libraries were constructed using KAPA Stranded mRNASeq Kit. During library construction, sequence specific Illumina TruSeq ÂⓇ adapters were added to distinguish each library. Using an Agilent 2100 Bioanalyzer, average fragment lengths were determined to be between 230 and 300 bp. Libraries were then pooled and sequenced by the Duke Center for Genomic and Computational Biology on an Illumina HiSeq 2500 instrument. The resulting reads (50 base pair, single end) were mapped to the M. luteus genome using bowtie2 [45] with the –verysensitivelocal option. Reads to exonic regions were counted using htseqcount [46] with the default settings (minimum alignment quality of 10 on the phred scale).
Abbreviations
 FPR:

False positive rate
 HEB:

Homeolog expression bias
 LRT:

Likelihood ratio test
 ROC:

Receiver operating characteristic
 WGD:

Whole genome duplication
References
 1
Ohno S. Evolution by Gene Duplication. New York: Springer; 1970.
 2
Otto SP, Whitton J. Polyploid incidence and evolution. Annu Rev Genet. 2000; 34:401–37.
 3
Wendel JF. Genome evolution in polyploids. Plant Mol Biol. 2000; 42:225–49.
 4
Crow KD, Wagner GP. What is the role of genome duplication in the evolution of complexity and diversity?. Mol Biol Evol. 2006; 23(5):887–92.
 5
Proulx SR. Multiple routes to subfunctionalization and gene duplicate specialization. Genetics. 2012; 190:737–51.
 6
McLysaght A, Hokamp K, Wolfe KH. Extensive genomic duplication during early chordate evolution. Nat Genet. 2002; 31(2):200–4.
 7
Dehal P, Boore JL. Two rounds of whole genome duplication in the ancestral vertebrate. PLoS Biol. 2005; 3(10):314.
 8
Spring J. Genome duplication strikes back. Nat Genet. 2002; 31(2):128–30.
 9
Chao DY, Dilkes B, Luo H, Douglas A, Yakubova E, Lahner B, Salt DE. Polyploids exhibit higher potassium uptake and salinity tolerance in arabidopsis. Science. 2013; 341(6146):658–9.
 10
Sémon M, Wolfe KH. Preferential subfunctionalization of slowevolving genes after allopolyploidization in Xenopus laevis. Proc Natl Acad Sci. 2008; 105(24):8333–8.
 11
Rastogi S, Liberles DA. Subfunctionalization of duplicated genes as a transition state to neofunctionalization. BMC Evol Biol. 2005; 5:28.
 12
Taylor JS, Raes J. Duplication and divergence: The evolution of new genes and old ideas. Annu Rev Genet. 2004; 38:615–43.
 13
Smet RD, de Peer YV. Redundancy and rewiring of genetic networks following genomewide duplication events. Curr Opin Plant Biol. 2012; 15:168–76.
 14
Blanc G, Wolfe KH. Functional divergence of duplicated genes formed by polyploidy during Arabidopsis evolution. The Plant Cell. 2004; 16(7):1679–91.
 15
Kassahn KS, Dang VT, Wilkins SJ, Perkins AC, Ragan MA. Evolution of gene function and regulatory control after wholegenome duplication: Comparative analyses in vertebrates. Genome Res. 2009; 19(8):1404–18.
 16
Huminiecki L, Wolfe KH. Divergence of spatial gene expression profiles following speciesspecific gene duplications in human and mouse. Genome Res. 2004; 14(10a):1870–9.
 17
Makova KD, Li WH. Divergence in the spatial pattern of gene expression between human duplicate genes. Genome Res. 2003; 13(7):1638–45.
 18
Gu Z, Nicolae D, Lu HH, Li WH. Rapid divergence in expression between duplicate genes inferred from microarray data. Trends Genet. 2002; 18(12):609–13.
 19
Qiu Y, Liu SL, Adams KL. Frequent changes in expression profile and accelerated sequence evolution of duplicated imprinted genes in Arabidopsis. Genome Biol Evol. 2014; 6(7):1830–42.
 20
Jiao Y, Wickett NJ, Ayyampalayam S, Chanderbali AS, Landherr L, Ralph PE, Tomsho LP, Hu Y, Liang H, Soltis PS, et al. Ancestral polyploidy in seed plants and angiosperms. Nature. 2011; 473(7345):97–100.
 21
Wood TE, Takebayashi N, Barker MS, Mayrose I, Greenspoon PB, Rieseberg LH. The frequency of polyploid speciation in vascular plants. Proc Natl Acad Sci. 2009; 106(33):13875–9.
 22
RennyByfield S, Wendel JF. Doubling down on genomes: Polyploidy and crop plants. Am J Bot. 2014; 101(10):1711–25.
 23
Kobel HR, Pasquier LD. Genetics of polyploid Xenopus. Trends Genet. 1986; 2:310–5.
 24
Wolfe KH. Yesterday’s polyploids and the mystery of diploidization. Nat Rev Genet. 2001; 2:333–41.
 25
Albertin W, Marullo P. Polyploidy in fungi: evolution after wholegenome duplication. Proc R Soc B. 2012; 279:2497–509.
 26
Gallardo MH, González CA, Cebrián I. Molecular cytogenetics and allotetraploidy in the red vizcacha rat Tympanoctomys barrerae (Rodentia, Octodontidae). Genomics. 2006; 88:214–21.
 27
Rivera AS, Pankey MS, Plachetzki DC, Villacorta C, Syme AE, Serb JM, Omilian AR, Oakley TH. Gene duplication and the origins of morphological complexity in pancrustacean eyes, a genomic approach. BMC Evol Biol. 2010; 10(1):123.
 28
Hardison RC. Evolution of hemoglobin and its genes. Cold Spring Harb Perspect Med. 2012; 2(12):011627.
 29
Hu C, Lin Sy, Chi Wt, Charng Yy. Recent gene duplication and subfunctionalization produced a mitochondrial GrpE, the nucleotide exchange factor of the Hsp70 complex, specialized in thermotolerance to chronic heat stress in arabidopsis. Plant Physiol. 2012; 158(2):747–58.
 30
Remnant EJ, Good RT, Schmidt JM, Lumb C, Robin C, Daborn PJ, Batterham P. Gene duplication in the major insecticide target site, rdl, in drosophila melanogaster. Proc Natl Acad Sci. 2013; 110(36):14705–10.
 31
Wang Z, Gerstein M, Snyder M. RNASeq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009; 10:57–63.
 32
Soneson C, Delorenzi M. A comparison of methods for differential expression analysis of RNAseq data. BMC Bioinformatics. 2013;14(91).
 33
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNAseq data with DESeq2. Genome Biol. 2014; 15(12):550.
 34
Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010; 11:R106.
 35
Grover CE, Gallagher JP, Szadkowski EP, Yoo MJ, Flagel LE, Wendel JF. Homoeolog expression bias and expression level dominance in allopolyploids. New Phytol. 2012; 196:966–71.
 36
Wilks SS. The large sample distribution of the likelihood ratio test for testing composite hypotheses. Ann Math Statist. 1938; 9(1):60–62.
 37
Edger PP, Smith RD, McKain MR, Cooley AM, VallejoMarin M, Yuan Y, Bewick AJ, Ji L, Platts AE, Bowman MJ, et al. Subgenome dominance in an interspecific hybrid, synthetic allopolyploid, and a 140 year old naturally established neoallopolyploid monkeyflower. The Plant Cell. 2017; 29(9):2150–67.
 38
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B Methodol. 1995; 57:289–300.
 39
Groppe DM. Benjamini & Hochberg/Yekutieli false discovery rate control procedure for a set of statistical tests. 2015. https://www.mathworks.com/matlabcentral/fileexchange/27418fdrbh. Accessed 6 Dec 2017.
 40
Akama Satoru, ShimizuInatsugi R, Shimizu KK, Sese J. Genomewide quantification of homeolog expression ratio revealed nonstochastic gene regulation in synthetic allopolyploid arabidopsis. Nucleic Acids Res. 2014; 42(6):e46. Oxford University Press.
 41
Liu Y, Zhou J, White KP. RNAseq differential expression studies: more sequence or more replication?. Bioinformatics. 2013; 30(3):301–4.
 42
Schurch NJ, Schofield P, Gierlinski M, Cole C, Sherstnev A, Singh V, Wrobel N, Gharbi K, Simpson GG, OwenHughes T, Blaxter M, Barton GJ. How many biological replicates are needed in an RNAseq experiment and which differential expression tool should you use?. RNA. 2016; 22(6):839–51.
 43
Roulin A, Auer PL, Libault M, Schlueter J, Farmer A, May G, Stacey G, Doerge RW, Jackson SA. The fate of duplicated genes in a polyploid plant genome. The Plant J. 2013; 73(1):143–53.
 44
Robinson MD, Smyth GK. Smallsample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics. 2008; 9(2):321–32.
 45
Langmead B, Salzberg SL. Fast gappedread alignment with Bowtie 2. Nat Methods. 2012; 9(4):357–9.
 46
Anders S, Pyl PT, Huber W. HTSeq–a Python framework to work with highthroughput sequencing data. Bioinformatics. 2014; 31:638.
Acknowledgements
The work was supported in part by National Science Foundation Grant DMS 1121606 to GDCS, the Biomathematics Initiative at The College of William & Mary, and a W&M Summer Research grant to JRP.
Availability of data and materials
Sample MATLAB code and the data used in ΔHEB analysis can be found on the Mathworks file exchange, submission number 62502: https://www.mathworks.com/matlabcentral/fileexchange/62502. Raw sequence reads are available on the NCBI SRA under accession number PRJNA380107: https://www.ncbi.nlm.nih.gov//bioproject/PRJNA380107.;
Author information
Affiliations
Contributions
All four coauthors were involved in conception of the work; data analysis and interpretation; and drafting, critical review and final approval of the article. Data collection (TJK and JRP). Mathematical framework and numerical scheme (RDS and GDCS). TJK supervised by JRP. RDS was jointly supervised by JRP and GDCS. All authors read and approved the final manuscript.
Corresponding author
Correspondence to Joshua R. Puzey.
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Smith, R., Kinser, T., Smith, G.C. et al. A likelihood ratio test for changes in homeolog expression bias. BMC Bioinformatics 20, 149 (2019). https://doi.org/10.1186/s1285901927095
Received:
Accepted:
Published:
Keywords
 Homeolog expression bias
 Likelihood ratio test
 Polyploid
 RNAseq
 Whole genome duplication