Generalized shrinkage F-like statistics for testing an interaction term in gene expression analysis in the presence of heteroscedasticity

Background Many analyses of gene expression data involve hypothesis tests of an interaction term between two fixed effects, typically tested using a residual variance. In expression studies, the issue of variance heteroscedasticity has received much attention, and previous work has focused on either between-gene or within-gene heteroscedasticity. However, in a single experiment, heteroscedasticity may exist both within and between genes. Here we develop flexible shrinkage error estimators considering both between-gene and within-gene heteroscedasticity and use them to construct F-like test statistics for testing interactions, with cutoff values obtained by permutation. These permutation tests are complicated, and several permutation tests are investigated here. Results Our proposed test statistics are compared with other existing shrinkage-type test statistics through extensive simulation studies and a real data example. The results show that the choice of permutation procedures has dramatically more influence on detection power than the choice of F or F-like test statistics. When both types of gene heteroscedasticity exist, our proposed test statistics can control preselected type-I errors and are more powerful. Raw data permutation is not valid in this setting. Whether unrestricted or restricted residual permutation should be used depends on the specific type of test statistic. Conclusions The F-like test statistic that uses the proposed flexible shrinkage error estimator considering both types of gene heteroscedasticity and unrestricted residual permutation can provide a statistically valid and powerful test. Therefore, we recommended that it should always applied in the analysis of real gene expression data analysis to test an interaction term.


Background
The regulation of gene expression starts when a cell's DNA is transcribed into mRNA. The simultaneous expression profiles of many genes under different circumstances can provide insight into physiological processes. Using modern technologies in gene expression experiments such as oligonucleotide arrays [1], and cDNA spotted arrays [2], many scientists have made novel discoveries about complex biological processes of yeast [3,4], drosophila [5], mice [6], humans [7], and other species. Recently one such study also included RNA-seq [8]. Statistical methodologies and issues involved in microarray data analysis have been widely reviewed [9][10][11][12], and it is expected that many of the same issues will need to be addressed with RNA-seq.
The analysis of variance (ANOVA) model is a popular statistical modeling method for the analysis of microarrays. Since its introduction by Kerr et al. [13], it has been extensively examined for use in this setting [14][15][16][17][18][19][20][21]. Kerr et al. constructed an ANOVA model that included the gene effect as a fixed effect. This model assumes identically and independently distributed residual errors across genes. The advantage of this model is that the large number of genes involved in a microarray experiment results in huge degrees of freedom for the error estimate, which can lead to a very powerful test. However, the common assumption of homoscedasticity may not hold true in this setting [22]. One alternative is to use an ANOVA model for each gene, but the resulting test statistics from gene-specific models may have limited power because the biological sample size for each gene in a microarray experiment is usually small.
To address this problem of limited power, researchers have proposed other methods for obtaining more information across genes, ranging from a simple equalweighted average of a gene-specific error estimate and the global average of all gene-specific error estimates (F 2 statistic proposed by Wu et al. [19] to empirical Bayesian modeling of all gene-specific errors [23][24][25][26]. Other variations [27][28][29] used different variance modeling strategies to address the heteroscedasticity problem, but no clear winner has emerged [30]. Huang and Liu [31] extended the test statistics proposed by Cui et al. [28] by assuming a normal distribution on the mean and then deriving an empirical Bayes likelihood ratio test. The resulting test statistic shrinks both the mean and variances.
In addition to the problem of between-gene heteroscedasticity, we must also be concerned with within-gene heteroscedasticity. For example, in the study of simple differential gene expression between a treatment group and a control group, the variance in the treatment arm may differ from that in the control arm. Some approaches to this problem include a general Bayesian framework to model heteroscedastic error in a single generalized linear mixed model setting [32] and a structural model placed on the error variances specific to each gene and treatment combination [33].
As gene expression studies become more popular, the complexity of the experiment increases. Instead of only simple treatment and control experiments, two or more factor experiments are being conducted. This increase in experiment complexity has led to many scientific questions involving the hypothesis testing of an interaction between two factors. For example, testing a probe by genotype interaction can result in inferences about polymorphism in the probe, such as single nucleotide polymorphism (SNP) and insertion-deletion (indel) [34][35][36][37]; testing a probe by sex can imply that alternative splicing occurs between male and female subjects [38]; and in pharmacogenomic studies, testing the genotypedrug/treatment or genotype-disease interaction may be of interest [39]. Thus far, all the development of ANOVA methods for microarray studies has focused on tests of main effects.
Here, a generalized shrinkage estimator incorporating both within-and between-gene heteroscedasticities is developed (see Lehmann and Cesella [40] for a review of shrinkage estimation). In any given experiment, both within-gene and between-gene heteroscedasticity may exist; thus, taking these possibilities into account should lead to an improved test statistic. Moreover, given the increasing complexity of recent studies and the burgeoning interest in hypotheses that involve interactions, we focus on an improved shrinkage-based F-test for interaction terms.

Methods
Here we develop new shrinkage estimates for the error term and show how to use these estimates to construct F-like statistics. We then estimate the null distribution of these statistics by using permutation tests.

Shrinkage error estimators
Shrinkage error estimators pull individual error estimates toward shrinkage targets, with the amount of shrinkage depending on the variability of individual error estimates [28,40]. Let the gene-specific error estimates for all genes i and subgroups k be σ 2 1,1 , ...,σ 2 1,K , ...,σ 2 I,K , i = 1,...,I, k = 1,..., K, and let σ 2 i,k be the true variance of gene i in group k. When the experimental design is balanced,σ 2 i,k is the residual mean square for gene i in group k and νσ 2 i,k /σ 2 i,k ∼ χ 2 ν , where ν represents the degrees of freedom for the error estimates.
The choices of shrinkage targets in microarray data include the following: 1. Specific values for each gene-group combination 2. Gene-specific values that are the same across all other groups 3. Group-specific values that are the same across genes but different across groups 4. A single point representing the underlying common error Correspondingly, these targets are correct when (1) there are both within-gene and between-gene heteroscedasticity; (2) there is only between-gene heteroscedasticity; (3) there is only within-gene heteroscedasticity; and (4) all error variances are identical. We now develop a generalized shrinkage error estimator using these four shrinkage targets. Let where m is the mean of log log χ 2 ν /ν. Then using asymptotic normal approximation of X i,k , the distribution of X i,k s with different shrinkage targets for different gene i and group k combinations is whereθ = θ 1,1 , ..., θ 1,K, ..., θ I,1 , ..., θ I,K ,α = (α 1 , ..., α I ) represents the gene-specific mean differences, and β = (β 1 , ..., β K ) models different means with respect to different classes of the subgroups.
If s 2 and τ 2 are known, then the Bayes estimator of θ i, k under the squared error loss is [39]: Here, s 2 is the variance of log χ 2 ν /ν and is known [28,40], but τ 2 is not known. However, the marginal distribution of X i,k can be used to create an empirical Bayes estimator of τ 2 and hence of θ i,k . Marginally, X i,k N(μ + a i + b k , s 2 + τ 2 ),i = 1,..., I, k = 1, ...K, and, from this model, the least square estimates of μ,α,β,μ,α,β , are the uniformly minimum-variance and unbiased estimators. Using the fact that Then, we can construct the positive-part empirical Bayes estimator [40]: where(x) + = max(x, 0). The generalized shrinkage error estimate for s i,k can be obtained through exponentiating θ EB+ i,k as follows: Using a similar argument, the generalized shrinkage error estimator with the shrinkage target at each gene isσ with the shrinkage target at each group is and with the shrinkage target at the common error, we haveσ The shrinkage error estimator proposed by Cui et al. [28] shrinks the gene-specific error estimators toward their common corrected geometric mean. Specifically, the estimator for σ 2 i is calculated as where X i is the residual variance estimate from a gene-specific model, and m and s 2 are the mean and variance of log The underlying assumption for this estimator is that there is no between-gene heteroscedasticity, as this estimator shrinks every gene-specific error estimator toward one target. Therefore, it will overshrink the gene-specific error estimates when gene heteroscedasticity exists. In comparison, generalized shrinkage error estimators are flexible in terms of incorporating a different type of heteroscedasticity. Some degrees of freedom are used for incorporating the heteroscedasticity. However, the gain is that the error estimator is then closer to the underlying distribution and should lead to better performance of the resultant F-like test statistics as shown in the results section.
In formulas (2), (3), (5), and (6), m is the mean and s 2 is the variance of a log-transformed chi-square random variable. The simulation-based approximate values of m and s 2 can be found from Table 1 in work of Cui et al. [28]. Pounds [41] gave analytical expressions for these parameters and developed R code for the exact calculation. Here, the simulation-based approximate values were used.

Shrinkage F-like statistics
To construct a statistic for the hypothesis test of no interaction between two fixed effects, the traditional Ftest is simply the ratio of the mean square of the interaction term (MSI) and the mean square of residuals (MSE). This F-test, referred to as F 1 [42], is The F 1 test corresponding to a specific gene i is denoted by The error variance estimator in this test uses data from only gene i. In oligonucleotide mi-croarray models, the degrees of freedom for the error estimate can be small because the sample size of RNA is usually small, and hence the power of F 1 can be limited.
Following the method of constructing an F-test statistic given by Neter et al. [42], the gene-specific shrinkage F-like statistics for testing an interaction between two fixed effects can be obtained as When the homoscedastic error assumption is true, the pooled variance estimator,σ 2 pool , can be used to construct an F-like statistic. For a balanced design, the pooled variance estimate is the average of all gene-specific error estimates. This statistic is denoted by F 3 using the same notation used by Cui and Churchill [22], who also introduced another shrinkage-type F statistic, F 2 , which can also borrow information across genes when estimating the residual variances. The statistic F 2 uses an equalweighted average of a gene-specific error estimatorσ 2 andσ 2 pool . The definitions of F 2,i and F 3,i are

Permutation tests
For the proposed generalized shrinkage F-like test statistics, the null distributions are not known named distributions. Therefore, an empirical approach such as a permutation test can be used to estimate the null distributions. The permutation test for interaction is complicated, because there is no exact permutation test for such a purpose [43]. We therefore must consider an approximate permutation method for testing an interaction term in a crossed fixed/mixed model [44,45]. Permutation approaches developed previously focused on a single ANOVA model. In the typical gene expression study, thousands of ANOVA models are considered simultaneously. The additional complexity of the shrinkage F-like statistics indicates that Monte Carlo studies are needed to investigate the performance of residual permutation and raw data permutation, with restrictions or not, in a gene-expression analysis. The choice of permutation procedures is critical for assessing the performance of a test statistic.
For all the modified F-like statistics presented in the previous section, the null distributions can only be approximated empirically, but permutation procedures can be used to find the approximate null distribution of all the F and F-like statistics. The important issues in performing a permutation analysis include the choice of the exchangeable units under the null hypothesis, the choice of using restricted permutation or not, and the choice of residual permutation or raw data permutation. These choices influence the power of a test statistic.
Residual permutation using residuals from a reduced model and unrestricted raw data permutation can be used to approximate the null distribution of a statistic for testing an interaction term [44]. When using F 1 to test an interaction term in a single ANOVA model, the residual permutation leads to a more powerful test than unrestricted raw data permutation [44]. However, in gene expression analysis, thousands of gene-specific ANOVA models are simultaneously considered, and for a particular gene-specific ANOVA model, information from other gene-specific ANOVA models is used to construct the shrinkage error estimate. Hence, both residual permutation and raw data permutation were investigated. Furthermore, both restricted and unrestricted permutations were studied, because the permutation units are exchangeable only within each particular group when within-gene heteroscedasticity is present across those subgroups.

Results
The properties of this shrinkage estimator are compared with those of other existing F and F-like statistics that have been proposed and described in the "Shrinkage Flike statistics" section.

Simulation studies
The purpose of these simulation studies was to compare the performances of F 1 , F 2 , F 3 , F Cui , F Gen , F Gen-gene , and F Gen-grp in terms of type I error and power and to compare the results of a particular F-like statistic using four different permutation strategies: restricted/unrestricted residual permutation and restricted/unrestricted raw data permutation. In these simulation studies, 100 genes with two probes for each gene and three replicates from each of two lines were simulated to mimic a split-plot design in a general oligonu-cleotide microarray experiment. The gene-specific ANOVA model in which data were generated from the model, y plr = P p + L l + RL rl + PL pl + plr , wp = 1, 2, l = 1,2, r = 1,2,3, where P, L, RL, and PL represent probe, line, replicates from a particular line, and the interaction between probe and line, respectively.
Replicates were nested within each line, and RL is usually treated as a random effect during the model-fitting procedure, which results in a correlation between probes from the same biological sample. In the simulated data sets, the correlation between genes was 0. As many as 900 simulation runs were carried out to compare the performances of F 1 , F 2 , F 3 , F Cui , F Gen , F Gen-gene , and F Gen-grp based on different permutation procedures. The four permutations tested were unrestricted residual permutation, restricted residual permutation with respect to each line, unrestricted raw data permutation, and restricted raw data permutation with respect to each line. The residuals permuted were from a reduced fixed model with fixed effects for only line and probe.
Two types of data were simulated: null cases and cases with a probe by line interaction at a range of degrees. Null cases included: null-ce, all probe-level expression values were simulated from the standard normal distribution; null-gh, the gene-specific error variances were simulated from the log-normal distribution with mean log at 0 and standard deviation at 2, mimicking the general heteroscedastic error distribution in typical datasets; null-wgh, all genes had the same error structures and the residual error variance of line 1 was 100 times that of line 2; null-bgh, simulated data were modified from null-gh, with the variance of line 1 multiplied by 100. Correspondingly, ce, gh, wgh, and bgh in Figures 1  and 2 were simulated by adding interaction terms to null-ce, null-gh, null-wgh, and null-bgh. Quantitative interaction was assumed and the differences in the opposite direction were set to make the detection powers for an interaction term based on traditional Fstatistics and tabled p-values range from 0.05 to 0.95. Tables 1 and 2 show the results from 900 simulation runs using raw data permutation and residual permutation, respectively. Data in Table 1 suggest that when both types of gene heteroscedasticity exist, the unrestricted raw data permutation had a greater average comparison-wise error rate (CWER) than residual permutation. Raw data permutation with restriction can control prespecified CWER im all cases. In Table 2, for the common error cases, all test statistics had the prespecified CWER from both restricted and unrestricted residual permutation. When within-gene heteroscedasticity existed, F 1 and F Cui had inflated CWER from both two residual permutation tests. Restricted residual permutation reduces, but does not solve, this problem. For F 2 and F 3 , only the restricted residual permutation could control the prespecified CWER. For F Gen , F Gen-gene , and F Gen-grp , restricted residual permutation gave conservative results in terms of having CWER smaller than the prespecified level. When the shrinkage target is correctly set, unrestricted residual permutation controls the nominal CWER. As expected, only F Gen coupled with unrestricted residual permutation could be used for all cases, because the CWER was always less than the nominal level.
Further simulations to compare the rejection rates were conducted. Only results from residual permutation are shown because it was found that raw data permutation was less powerful than residual permutation. This is consistent with the findings of Anderson and Ter Braak [44]. Figure 1 shows the estimated average null hypothesis rejection rate curves from all F-like statistics and both restricted and unrestricted residual permutation procedures. The x-axis represents the average null hypothesis rejection rate using F 1 and the tabulated pvalues. The solid line shows that the corresponding statistic controls the prespecified CWER, and the dashed line shows that the corresponding CWER was inflated. In general, restricted residual permutation is less powerful than unrestricted residual permutation. For example, the power of all statistics from unrestricted residual permutation almost doubled in some cases where heteroscedasticity existed.
When the common error assumption is valid, F 3 is obviously the most powerful test and the prespecified CWER is controlled. All other F-like statistics performed very similarly in this case. When the shrinkage target was correctly set, the resultant test statistic was the most powerful one. For example, when there was only within-gene heteroscedasticity, F Gen-grp was more powerful than F Gen and F Gen-gene based on either restricted or unrestricted residual permutation. The rejection rate comparison of statistically valid test statistics is further illustrated in Figure 2, where the x-axis is the average rejection rate from using F Gen and unrestricted residual permutation. Figure 2 clearly shows that unrestricted residual permutation is more favorable in terms of power. F Gen-grp appears to be more powerful than F Gen , but when both types of gene heteroscedasticities occur, F Gen grp has inflated CWER.

Drosophila data
The data used in this study are from a gene expression comparison study between D. melanogaster and D. simulans [46]. Expression of 10 genotypes of each species was measured in male flies. In D. simulans, each genotype was measured separately, and in D. melanogaster, a pool of 10 genotypes was measured. All genotypes (individual or pooled) were independently isolated and hybridized three times. The goal of the original study was to provide a genome-wide approach to identifying candidate genes potentially responsible for adaptation and speciation in D. simulans and D. melanogaster. In this study, we focus on Figure 1 The comparison of power curves of all F-like test statistics. The x-axis is the average power from analyzing 900 simulated data sets using F 1 with tabled p-values. The y-axis is the estimated powers using empirical gene-specific null distributions from 1,000 residual permutations. The upper four plots show the results with restricted residual permutation, while the lower four plots show the results from unrestricted residual permutation. The solid line indicates the empirical average CWER of a statistic is at the prespecified level, and the dashed line shows an inflated empirical average CWER."ce," all genes have common error; "gh," only between-gene heteroscedasticity exists; "wgh," only within-gene heteroscedasticity exists; "bgh," both between-gene and within-gene heteroscedasticity exist.
identifying sequence differences between genotypes in D. simulans based on hybridization profiles. Within-gene heteroscedasticity is expected because the genotypes come from different lines. The proposed generalized shrinkage F-like test statistics F Gen , F Gen-gene , and F Gen-grp were compared with F 2 , F 3 with restricted residual permutation, which could control prespecified CWER for any variance structure in simulation studies. Furthermore, Smyth's moderated F-test statistic [25] without multiple testing adjustment and controlling the false discovery rate (FDR) at 5% were used for comparison. As the main interest is in sequence difference, the focus is on the test of interaction between line and probe. The split plot model described above is used. SAS program codes are included in the additional files (additional file 1 and additional file 2).
The Drosophila genome has been fully sequenced and both SNPs and indels can cause a significant interaction term. Thus, the false positive rate and detection power Figure 2 The comparison of power curves of F Gen from unrestricted residual permutation versus other F-like test statistics. Only results from permutation combinations that can control prespecified CWER are used in this figure. The x-axis is the average power after analyzing 900 simulated data sets using F Gen and 1,000 unrestricted residual permutations. The y-axis is the estimated power from other F-like test statistics and empirical gene-specific null distributions based on the appropriate permutation. The solid black line corresponds to F Gen with unrestricted permutation, and this test always controls prespecified CWER."ce," all genes have common error; "gh," only between-gene heteroscedasticity exists; "wgh," only within-gene heteroscedasticity exists; "bgh," both between-gene and within gene heteroscedasticity exist; "res," restricted permutation; "unres," unrestricted permuation. based on SNP/indel sequence information can be calculated for a subset of the data. In the data set, there were 10 lines from D. simulans and three replicates from each line. Each probe set had 14 probes. The 1,285 probesets containing all "good" probes were selected. A "bad" probe's sequence satisfies one or more of the following criteria: it matches the D. simulans genome multiple times; it cannot be mapped to the flybase 4.2.1 genome; or, it has no information, such as hitting outside an exon, hitting a poorly aligned region, or hitting a region lacking a sequence. SNP or indel information could be determined in 777 probesets. For this data set, there was a high degree of within-gene heteroscedasticity: about 22.3% of the probe sets had a difference in line-specific residual variance estimates as large as or more than a 10-fold change. Therefore, as suggested by the conclusions from simulation studies, unrestricted residual permutation and restricted residual permutation were used for generalized shrinkage F-like test statistics (F Gen , F Gen-gene , F Gen-grp ) and restricted residual permutation was used for statistics (F 2 , F 3 ). The results are shown in Table 3. Consistent with the findings from the simulation studies, F Gen had about 30% more detecting power by valuing the within-gene heteroscedasticity than the other F-like test statistics (F 2 , F 3 ). The false discovery rate of F Gen was slightly higher than that of F 2 , F 3 . F Gen-gene and F Gen-grp performed similarly to F Gen . Both of Smyth's moderated F-test statistic without multiple testing adjustment and with FDR set at 5% for multiple testing adjustment detected more SNPs and indels but at the expense of a greater FDR than F Gen .

Discussion
For gene expression analysis, ANOVA models have been a popular modeling technique. Based on ANOVA models, flexible shrinkage F-like test statistics were developed to account for both the within-gene and betweengene heteroscedasticities. The emphasis here is on testing an interaction term, as this case is of increasing interest to biologists, and there is no clear existing theory on the most powerful, valid approach for such statistics. For all F-like statistics studied here, their null distributions were approximated empirically through permutations. Four different permutation procedures were investigated for eight different F-like statistical tests of the interaction term.
As expected, we found that when an error estimator overshrinks, the resulting F-like statistic cannot control the prespecified CWER. For example, F Gen-gene is an over-shrinkage error estimator when there is withingene heteroscedasticity. As a result, compared with generalized shrinkage F-like statistics, it is not valid when within-gene heteroscedasticity exists. Undershrinkage is also important, as it will lead to a conservative test and lower power. This is clearly demonstrated when the common error can be assumed and the most powerful valid test is F Gen-grp .
The most striking result was the impact of the permutation procedures. Although this was not completely unexpected [43][44][45], the effect of the permutation procedures is dramatic and worthy of special attention. Unrestricted raw data permutation could not control prespeci-fied CWER when there was within-gene heteroscedasticity. Restricted raw data permutation could be used, but it was less powerful than residual permutation. Also consistent with findings from Anderson and Ter Braak [44], restricted permutations are less powerful than unrestricted permutations. However, unrestricted permutations are valid only for a common error and when between-gene heteroscedasticity exists for our proposed shrinkage statistics; they are not valid in combination with F 2 , F 3 , or F Cui . For F Gen-grp , the unrestricted permutation can also be used in cases having within-gene heteroscedasticity, while only F Gen is valid with unrestricted permutation in all cases in terms of controlling prespecified CWER. Interestingly, The CWER was set to 0.05. Gene-specific cutoff values were obtained from 1,000 permutations. "moderated F-1" and "moderated F-2" represent results from using moderated F statistic without any multiple testing adjustment and setting FDR to 5%. the power gain from using the correct shrinkage target F Gen-grp rather than F Gen is far less than that of using unrestricted permutation. The result is that F 3 is never the most powerful choice when testing an interaction term.
The correct shrinkage target can lead to the most powerful test statistic. As one of the reviewers suggested, a statistical test may be applied to help pick the best shrinkage target before obtaining shrinkage error estimates. However, this extra testing step may inflate the CWER of the test statistic when there is gene heteroscedasticity. For example, when there are both types of gene heteroscedasticities, it is possible that the above test suggests only within-gene heteroscedasticities exist, and F Gen-grp is shown to inflate the CWER. There is minimal penalty to using the shrinkage estimator we propose, so we recommend setting the shrinkage target in the full space spanned by group and gene and using unrestricted permutation to compensate for the possible power loss in fewer degrees of freedom left for estimating the errors.

Conclusions
The proposed generalized shrinkage F-like statistic with shrinkage targets located in a space spanned by gene and another group, F Gen , with unrestricted residual permutation is always valid in terms of having a prespecified CWER. This statistic has reasonable power in most cases; thus, it is generally recommended to be applied to test an interaction term in the analysis of real gene expression data.

Additional material
Additional file 1: SAS program code 1. SAS program code for analyzing the real data set using residual permutation without restriction.
Additional file 2: SAS program code 2. SAS program code for analyzing the real data set using residual permutation with restriction.
List of abbreviations CWER: comparison-wise error rate; FDR: false discovery rate; indel: insertion and deletion; SNP: single nucleotide polymorphism;